Altered Craft

Weekly Review: Around the Model

Sam Keen — Mon, 22 Jun 2026 12:06:24 GMT

Welcome back to Altered Craft’s weekly AI review for developers. Thanks for reading, and for sticking with us into a quieter week after the frontier-model drama. One pattern runs through this issue: as models get cheaper, more open, and easier to swap, the hard part moves to everything around them. Bayer’s context discipline, two new agent harnesses, and three editorials on review and judgment all point the same way. The edge is in the scaffolding, not the core.

TUTORIALS & CASE STUDIES

This week’s case studies share a thread: the scaffolding and ownership around a model matter more than its raw size. We open with a disciplined enterprise build, then move to running the stack yourself.

How Bayer Built a Reliable Agentic RAG System for Drug Discovery

Estimated read time: 12 min

Bayer’s PRINCE platform grew from keyword search into a multi-agent research assistant for preclinical data. Its reliability comes from context discipline over larger context windows, giving each agent only what it needs, backed by LangGraph orchestration, model fallbacks, and Langfuse observability.

Why this matters: Bigger context windows don’t replace deliberate context engineering. Scope what each agent actually sees, and the whole system becomes easier to steer, debug, and evaluate over time.

Building a Phone-Friendly AI Dev Platform for Your Homelab

Estimated read time: 5 min

From a corporate platform to a personal one, this setup wires OpenCode’s web UI into a homelab so an AI agent can maintain Docker Compose stacks while staying behind PR review with a small blast radius. It runs isolated, pushes feature branches, and humans merge before deploy.

The opportunity: Give your AI agent its own user, an isolated VM, and PR-only access, so it can manage real infrastructure safely, without ever touching production directly.

Local Models Are Actually Good Now

Estimated read time: 5 min

Once you’re running your own infrastructure, the next question is which models fit on it. Continuing our coverage of the shift toward cheaper, smaller models[1] last week, this hands-on account runs agentic coding on a 64GB M2 Mac, where the Gemma 4 family hits roughly 75% of frontier accuracy and speed, with a Pi-plus-LM-Studio setup and Docker sandboxing.

What’s interesting: Local models have crossed a threshold. Agentic coding tasks that were impossible six months ago now work well enough that you can often skip double-checking against an API.

[1] The Coming Shift From Bigger Models to Cheaper Ones

Local AI Is Not Opus: A Founder’s Receipts

Estimated read time: 12 min

That optimism comes with a caveat. A bootstrapped founder shares hard data from running Qwen 27B on a $12,000 GPU. The card pays for itself on support work, but local models are a different tool, not a cheaper Claude, looping once quantized for consumer hardware.

Worth noting: Local models earn their keep on privacy, fixed costs, and vendor risk for scoped, supervised tasks. They still can’t replace frontier models for unattended agentic work.

TOOLS

The tools this week assemble the agent stack from the outside in: a model to drive it, two harnesses to run it, a format to feed it context, and an auth layer to govern it.

GLM-5.2 Brings a Usable 1M-Token Context to Open-Source Coding

Estimated read time: 9 min

Z.ai’s MIT-licensed GLM-5.2 targets long-horizon coding with a 1M-token context that stays reliable under real engineering pressure. It tops open-source models on three benchmarks, adds effort-level controls, and introduces anti-hacking RL to keep training honest.

What this enables: If you run long, messy coding-agent trajectories, GLM-5.2 gives you a frontier-adjacent, openly licensed option where the 1M context actually holds up across hours-long tasks.

eve: Vercel’s Framework for Production-Ready Agents

Estimated read time: 7 min

To put a model like that to work, you need a harness. Vercel’s open-source eve makes an agent a directory of files defining tools, skills, and subagents. It ships durable sessions, sandboxed compute, approvals, multi-channel delivery, tracing, and evals, so you focus on behavior, not plumbing.

Key point: Define your agent as a versioned folder of files, then let the framework handle the durability, sandboxing, approvals, and deployment you would otherwise build by hand.

Flue: A Programmable Harness for Autonomous Agents

Estimated read time: 4 min

The Astro team takes that idea from a different angle. Flue is a TypeScript harness that gives any model the sessions, tools, skills, sandbox, and durable execution for autonomous work. It’s not another SDK but a programmable environment.

The takeaway: If you’re building agents that take real action, Flue handles the sandbox, durability, and observability plumbing, so you can focus on the task instead of the scaffolding.

Google’s Open Knowledge Format: A Standard for Feeding Context to AI Agents

Estimated read time: 8 min

A harness still needs something to feed it. Google Cloud’s Open Knowledge Format is a vendor-neutral spec that formalizes the LLM-wiki pattern into portable markdown bundles with YAML frontmatter. It treats knowledge as a format, not a service, so agents and humans read identical files.

Why now: If your agents keep reassembling the same scattered context from scratch, a portable markdown-and-YAML standard like OKF could make that knowledge readable by every tool you use.

MCP Gets Enterprise-Grade Auth: Authorize Once, Inherit Everywhere

Estimated read time: 4 min

Finally, governing all of this at scale. MCP’s Enterprise-Managed Authorization extension is now stable, letting organizations control server access through their identity provider instead of per-app consent prompts. The IdP becomes the authoritative decision-maker, so users log in once and inherit servers scoped to their roles.

The context: Deploying MCP in an enterprise? EMA lets you centralize access policy and audit in your identity provider, instead of leaving authorization to whatever each employee happened to approve.

NEWS & EDITORIALS

The editorials orbit one question: if agents now write the code, what part of the job stays yours? We open with the week’s biggest deal, then follow the answer through deciding, reviewing, and the discipline that cheap code demands.

SpaceX Buys Cursor for $60B in Stock After Record IPO

Estimated read time: 5 min

We open with the week’s biggest deal. SpaceX agrees to acquire AI coding startup Cursor in a $60 billion all-stock deal days after its record IPO, betting the tool helps its troubled xAI division catch the major labs. The move shows how AI coding became a strategic prize.

What to watch: The tools developers use daily are now acquisition targets in trillion-dollar bets. Watch how ownership shifts affect the roadmaps, pricing, and data practices of tools your team depends on.

Who Decides What: What 400K Claude Code Sessions Reveal About Agentic Work

Estimated read time: 8 min

That raises the question of who is actually steering. Building on our coverage of the decide-execute-deliver argument[2] last week, Anthropic’s analysis of roughly 400,000 Claude Code sessions finds a consistent split: people decide what to build, the agent decides how. Over seven months, debugging nearly halved while task value rose about 25%.

The lesson: Bring deep understanding of your problem to the agent. Domain expertise, not raw coding skill, is what makes the model do more and succeed more often.

[2] The “Decide-Execute-Deliver” Sandwich: Why AI Hasn’t Replaced Engineers

Review Is Now the Most Leveraged Skill in Software

Estimated read time: 14 min

If people own the “what,” verification is where that ownership shows up. Addy Osmani argues that as coding agents accelerate, the bottleneck moved from writing code to verifying it. Citing 2026 data showing 4x output for roughly 12% real value, he shows how review scales with blast radius.

The practice: Scale your human review to the blast radius of the change, and pair two deliberately different AI reviewers rather than chasing a single best tool.

When Code Becomes Disposable, Discipline Becomes the Product

Estimated read time: 15 min

Which points to a deeper shift. The piece argues cheap, regenerable AI code turns code into a materialized view of understanding, disposable when stale. Drawing on Phoenix Architecture ideas, it contends disposability raises the bar for rigor, pushing teams toward observability, characterization tests, and evals in production.

The shift: Stop treating code as the durable asset. Invest in evals, observability, and short feedback loops so the understanding lives where it can be regenerated on demand.

That’s the week. As the model becomes the easy part, your edge is everything you build around it: the context you curate, the harness you run it in, and the review that catches what it gets wrong. See you next Monday.

Tilth gets a Flight Recorder

Sam Keen — Fri, 19 Jun 2026 18:22:48 GMT

A quick progress update on Tilth, the small agent harness I introduced a while back for running open-weights models against real coding tasks. Most of the recent work has gone into one thing: being able to see what a session is doing, while it runs and after it finishes.

This is in the form of a web app composed of a dashboard and a visualized stream of the activity logs for a session.
Here we see the dashboard section for a finished session:

Four tasks, start to finish, in about six and a half minutes and 285k tokens. The reason I keep coming back to this view is that it answers a question the exit code can’t. A run that finishes and a run that finishes well look identical if all you have is all_done. The difference is in the shape.

A few patterns I read off it now without thinking:

Iteration counts spread evenly and well under the cap: a healthy run. One task pinned near the ceiling is one that got stuck.
Mostly accepts, few rejects: the worker and the evaluator agree. A wall of rejects means they are talking past each other.
Context pressure that climbs through a task and resets at each boundary: the out-of-context memory doing its job. A line that never resets is context quietly bloating.

Here is a fresh one kicking off: a path, a model, and the loop starts turning.

The model and harness activity stream live in the web view, filling in event by event.

And I can drop into any single iteration and read it like a transcript: the evaluator’s verdict with its reasoning, the ledger entry, the commit.

None of this is novel as “observability.” There is a lot of good writing on agent tracing right now. What I’d flag is smaller and more practical. I built all of this for myself, to debug my own runs, and the live view came almost for free, because every panel is a render of the same append-only event log. Build the log for the autopsy and the live dashboard falls out of it.

Why I care about this particular instrument: I am trying to work out how far cheaper open-weights models can be trusted with autonomous work. You can’t answer that from a pass or fail. You have to watch how the model gets there, and notice when it thrashes. This is the thing that makes that legible.

That is the progress. It is the instrument, not the study. The study is the next step: I want to batch-evaluate a finished session and score it on effectiveness and efficiency, so the shape I am reading by eye becomes something gradable.

The rest of the details are in the Tilth docs.

Weekly Review: Pulled Offline

Sam Keen — Mon, 15 Jun 2026 12:13:32 GMT

Welcome back to Altered Craft’s weekly AI review for developers, and thank you for being here for a heavy news week. The defining story is sobering: a US government directive forced Anthropic to pull Fable 5 and Mythos 5 offline for every customer, the first time a nation has recalled a deployed frontier model. A model you build on can now vanish overnight. So the rest of this issue leans into the hedge: build from first principles, route to smaller models, own more of your stack.

TUTORIALS & CASE STUDIES

We start at the foundations and climb: one neuron by hand, a full from-scratch curriculum, a tiny model trained for $80, then serving it efficiently and prompting the latest frontier release.

The Perceptron, Built From Scratch in Your Browser

Estimated read time: 8 min

Ranpara builds a single perceptron in plain Python and runs it live so you watch it learn. A student-pass example shows why the bias moves the boundary to where the answer lives, plus how normalization keeps training smooth.

The takeaway: Understand the weight, bias, and update loop of a single neuron and you understand the building block every neural network is made from.

503 Lessons That Build AI From Raw Math Up

Estimated read time: 9 min

From a single neuron to the full stack, this free, open-source curriculum spans 20 phases and 503 lessons across four languages. Its core principle: build every algorithm from raw math before touching a framework. Every lesson ships a reusable artifact, a prompt, skill, agent, or MCP server.

Why this matters: If you can call AI APIs but can’t explain what happens underneath, this curriculum closes that gap by making you build each piece by hand.

Building a Victorian LLM From (Almost) Scratch for $80

Estimated read time: 31 min

Putting that build-it-yourself ethos to the test, Cristi Constantin trains a 340M-parameter, Llama-based model knowledge-locked to the year 1900. The standout lesson: data quality, not compute, dominates the work, through custom de-duplication, compression ratios, entropy, and OCR scoring. Total GPU cost: roughly $80.

Key point: To truly understand how LLMs work, build a small one yourself, and expect most of your time to go into cleaning and filtering the data.

Serving LLMs Efficiently: A Hands-On vLLM Workflow

Watch time: 1h38m

Once a model exists, the next problem is serving it. This DeepLearning.AI course, built with Red Hat, shows that efficient LLM serving is mostly a memory problem, where weights and the KV cache compete for GPU space. You quantize a Qwen model, serve it with vLLM, and benchmark under load.

What this enables: Treat LLM deployment as a memory-management problem, then use quantization with vLLM’s PagedAttention and continuous batching to balance speed, cost, and accuracy.

Prompting Patterns for Claude Fable 5: What Changes When the Model Runs for Hours

Estimated read time: 7 min

At the frontier end of the spectrum, and newly poignant this week, Anthropic’s guide details the prompting shifts for Claude Fable 5 and Mythos 5, the very models just pulled offline. Because instruction-following is now strong enough to steer briefly, older prescriptive prompts can degrade output and need pruning before migration.

Try this: Re-audit your prompts and skills before migrating, since instructions tuned for older models can actively hurt Fable 5’s output. Call it homework for whenever it comes back. If it comes back.

TOOLS

This week’s tools equip the agent itself: the new frontier model to drive it, a brain and a sandbox to work in, an interface built for it, and a guardrail to keep it safe.

Serena: Giving Your Coding Agent an IDE’s Brain

Estimated read time: 8 min

To make any model useful in a large codebase, Serena is an open-source MCP server that gives coding agents semantic, symbol-level understanding of code instead of brittle text search. Backed by language servers across 40+ languages, it turns cross-file renames, lookups, and refactors into single atomic calls.

The opportunity: If your agent fumbles refactors with search-and-replace in a large codebase, an MCP server like Serena can collapse eight to twelve error-prone steps into one reliable operation.

Every Agent Needs Its Own Computer

Estimated read time: 6 min

If a brain handles the code, the next need is a safe place to run it. LangChain introduces LangSmith Sandboxes, hardware-virtualized microVMs that give each agent a disposable computer with filesystem, shell, and persistent state. The piece argues containers aren’t an isolation boundary for untrusted model-generated code.

The principle: If your agent runs dynamic or model-generated code, give it a hardware-isolated sandbox rather than a shared-kernel container you are quietly trusting.

Designing the hf CLI for Humans and Agents Alike

Estimated read time: 9 min

Tools also have to speak the agent’s language. Hugging Face rebuilt its hf CLI to serve both humans and coding agents, auto-detecting which is driving and rendering output accordingly. Benchmarks show a no-CLI baseline burns up to 6x the tokens on complex multi-step tasks.

Worth noting: Designing tool output for agents, compact, parseable, and full of next-command hints, can cut token usage dramatically without hurting the human experience.

Nemotron 3.5 Brings Custom Policy Reasoning to Multimodal Safety

Estimated read time: 9 min

Finally, a guardrail for everything above. NVIDIA’s 4B-parameter safety model unifies multimodal input, 12-language coverage, and custom enterprise policy enforcement in one inference call. The standout is auditable reasoning traces via THINK mode, letting teams enforce natural-language policies with a documented justification per verdict.

What this enables: If you build guardrails for regulated or multi-domain AI products, one model that accepts custom policies and emits auditable traces could replace a stack of brittle classifiers.

NEWS & EDITORIALS

The editorials open on the week’s biggest shock, a frontier pulled offline by government order, then widen out: the architecture you actually control, clearer thinking about “world models,” the economics pushing toward cheaper ones, and the part of the job that stays yours.

The Government Just Pulled Two Anthropic Models Off the Market

Estimated read time: 4 min

In an action without precedent, a June 12 US export-control directive ordered Anthropic to suspend Fable 5 and Mythos 5 for any foreign national, and the company disabled both models for everyone worldwide. Anthropic disputes the cited jailbreak, reading a codebase to fix its flaws, as narrow and reproducible in rival models, and says it is working to restore access.

Worth watching: Frontier access is something regulation can revoke overnight. Keep a fallback model and own more of your stack.

The AI Agents Stack: 2026 Edition

Estimated read time: 9 min

If a model can vanish overnight, the architecture you control matters more. Paolo Perrone redraws Letta’s influential agent stack diagram for 2026, mapping six layers from models to guardrails. The recurring argument: add complexity only when something specific breaks, with honest takes on lock-in and the demo-to-production gap at each layer.

The pattern: Start with the simplest stack that works and add layers only when something specific breaks, not in anticipation of problems you don’t yet have.

A Functional Taxonomy of World Models

Estimated read time: 9 min

Another effort to bring order to a loaded term, and World Labs sort “world model” systems into three functions: renderers output pixels, simulators output state, planners output actions. They argue simulation is the structural backbone from which fidelity and reliable action derive.

The lens: When evaluating a “world model,” ask which contract it fulfills, visual plausibility, structural accuracy, or correct action, because beautiful pixels rarely guarantee usable physics.

The Coming Shift From Bigger Models to Cheaper Ones

Estimated read time: 5 min

Building on our coverage of open versus closed model economics[1] last week, the question sharpens as subsidies slow and token prices climb: are companies ready to switch to smaller models? Early tests, like Harvey cutting inference costs 3x with no quality loss, suggest the real divide is large versus small.

Why now: Stop defaulting to the frontier model for everything and start routing each task to the smallest model that returns the right answer.

[1] Open and Closed Models Are on Different Exponentials

The “Decide-Execute-Deliver” Sandwich: Why AI Hasn’t Replaced Engineers

Estimated read time: 9 min

We close on what stays yours. Following our look at how the senior role is moving from writing code to directing it[2] last week, this essay examines headline layoffs at Block, Snap, and Intuit and finds AI is mostly a scapegoat for financial restructuring. Software work is a decide-execute-deliver sandwich: AI compresses the middle, but deciding what to build and owning delivery resist automation.

The shift: Code generation was never the bottleneck. Your value increasingly lives in framing problems and owning what ships, not in typing the implementation.

[2] When AI Starts Building AI: Inside Anthropic’s Self-Improvement Curve

Weekly Review: The Outer Loop

Sam Keen — Mon, 08 Jun 2026 12:10:10 GMT

Altered Craft

Welcome back to Altered Craft’s weekly AI review for developers, and thank you for starting your week with us. Last week we followed the money; this week we follow the loop. Agents are reaching deeper into their own cycle of writing and checking, from self-validating guardrails to workflows that fan across hundreds of subagents, with Claude now authoring most of its own merged code. As that inner loop closes, the work that stays yours moves outward: setting the guardrails, directing the fleet, and choosing which problems matter.

TUTORIALS & CASE STUDIES

We open with the principle that pushes you out of the inner loop, then turn to two production realities: keeping RAG costs in check and running frontier models inside your own cloud perimeter.

Backpressure Is All You Need: Stop Being Your Agent’s Bottleneck

Estimated read time: 11 min

Building on our coverage of Anthropic’s point that verification is now the bottleneck[1], Lucas da Costa argues humans have become the default backpressure in AI coding loops, shuttling feedback between agents and bots. His fix: build automated guardrails (tests, types, benchmarks, review agents) that force the LLM to validate its own work first.

The takeaway: If you are manually correcting your coding agent every cycle, you are not delegating, you are an expensive clipboard between two machines.

[1] Using LLMs to Secure Source Code: A Six-Step Loop

RAG Is Burning Money: Building a Cost Control Layer

Estimated read time: 14 min

Once the agent is checking its own work, the next thing to watch is the bill. A working RAG pipeline can drain budgets through over-fetched context, uncached repeats, and oversized models on trivial queries. This walkthrough builds a four-layer cost control system with semantic caching, routing, and token budgeting, achieving up to 85.8% cost reduction.

Worth noting: Treat RAG cost as a separate failure domain from quality, and instrument routing and caching before scaling traffic.

Running OpenAI Models on Amazon Bedrock: A Production Cookbook

Estimated read time: 12 min

Staying with production concerns, OpenAI’s cookbook walks through running GPT-5 models on Amazon Bedrock via an OpenAI-compatible Responses API, using a fictional retailer support workflow to cover structured outputs, function tools, PDF inputs, prompt caching, background mode, stateful conversations, and operational smoke checks.

What this enables: If you need OpenAI models inside your AWS perimeter, Bedrock now exposes the full Responses API surface with minimal code changes from a standard OpenAI SDK setup.

TOOLS

The model releases cluster by job this week: open-weight coding and multimodal models, an orchestrator built for long-running agents, the workflow layer that fans work across hundreds of subagents, and a scanner to keep the skills they use safe.

MiniMax M3: An Open-Weight Model Targeting Frontier Coding and Agentic Work

Estimated read time: 11 min

MiniMax releases M3, an open-weight model with 1M context and frontier coding scores powered by a new sparse attention architecture called MSA. Tests include a 24-hour autonomous CUDA kernel run that lifted Hopper FP8 utilization from 7.6% to 71.3%.

The signal: Long-horizon agentic coding is increasingly bottlenecked by attention architecture, not model size, and open-weight options are starting to close the gap with closed frontier models.

Microsoft’s MAI-Code-1-Flash: Trained Inside the Copilot Harness

Estimated read time: 3 min

Another approach to coding models, Microsoft’s MAI-Code-1-Flash was trained directly inside GitHub Copilot’s production harness rather than tuned for benchmarks. Adaptive solution length control reportedly cuts token use by up to 60% on harder problems while outperforming Claude Haiku 4.5 across four SWE-Bench evaluations.

Why this matters: Coding models trained inside the actual production harness tend to translate offline gains into real developer workflows more reliably than benchmark-optimized alternatives.

Gemma 4 12B Ships with Encoder-Free Multimodal Architecture

Estimated read time: 7 min

Shifting from coding to multimodal, Google releases Gemma 4 12B, a dense model with an encoder-free architecture that feeds vision and audio straight into the LLM backbone. It runs on 16GB VRAM, supports native audio, and ships with macOS apps and a local OpenAI-compatible server.

The opportunity: Developers can now run a single multimodal model locally that handles text, vision, and audio without juggling separate encoders or fine-tuning pipelines.

NVIDIA Nemotron 3 Ultra: Built for Long-Running Agents

Estimated read time: 9 min

Moving from models to orchestration, NVIDIA releases Nemotron 3 Ultra, a 550B-parameter MoE model designed for frontier reasoning and orchestration in agentic systems. It claims 5x higher throughput than peers and up to 30% lower cost per agent task, shipping fully open under OpenMDW-1.1.

The pattern: If you’re building multi-turn agents, pairing a frontier orchestrator with efficient execution models is becoming the dominant pattern, and Nemotron 3 Ultra is now a credible open option for the orchestrator role.

Claude Code Gets Dynamic Workflows: Hundreds of Parallel Subagents in One Session

Estimated read time: 5 min

Taking orchestration further, Anthropic’s 4.8 release brings dynamic workflows to Claude Code, where Claude writes orchestration scripts fanning work across hundreds of parallel subagents with adversarial verification. The lead example: Jarred Sumner’s port of Bun from Zig to Rust, 750,000 lines, eleven days.

Key point: When a task is too big for a single agent pass, dynamic workflows let Claude plan, parallelize, and self-verify across hundreds of subagents, but expect significantly higher token usage.

SkillSpector: NVIDIA’s Security Scanner for AI Agent Skills

Estimated read time: 9 min

To keep all this safe, NVIDIA’s open-source SkillSpector vets AI agent skills for Claude Code, Codex CLI, and Gemini CLI, combining static analysis with optional LLM evaluation across 64 vulnerability patterns in 16 categories. It produces a 0-100 risk score and SARIF output for CI/CD.

The principle: Treat third-party agent skills like any other untrusted dependency, and scan them before installation rather than granting implicit trust.

NEWS & EDITORIALS

The editorials trace where the human role is heading: from typing to describing, from writing code to directing it, then the market structure underneath, and a closing reminder to protect the thinking the agents can’t do for you.

The Speed of Prototyping in the Age of AI

Estimated read time: 7 min

Daryl Cecile reflects on a rough 4x speedup from AI agents, but argues the more important change is the shift toward describing systems before building them, which sharpens delegation and expands what work he can realistically take on.

What’s interesting: When agents handle the typing, the engineering skill that matters most is clearly describing what success looks like, so deliberately protect time for hands-on work to keep your instincts sharp.

Running an AI-Native Engineering Org: Lessons from the Claude Code Team

Estimated read time: 9 min

In more news on how the work itself is changing, Anthropic’s Claude Code team lead shares how agentic coding reshaped their workflows, from six-month roadmaps to just-in-time planning. The piece covers shifts in code review, context gathering, and team makeup as bottlenecks move from writing code to verifying it.

Try this: Pick your noisiest engineering workflow and ask whether it still serves its purpose, or whether it can be automated or dropped entirely.

When AI Starts Building AI: Inside Anthropic’s Self-Improvement Curve

Estimated read time: 14 min

Following our look at why Claude is not your architect[2] from last week, Anthropic shares internal data: Claude authors 80%+ of merged code, and engineers ship 8x more per quarter than in 2024. The piece traces the path toward recursive self-improvement, while noting Claude still struggles to choose which problems matter.

The shift: The role of a senior engineer is moving from writing code to directing, reviewing, and choosing which problems are worth solving at all.

[2] Claude Is Not Your Architect

Open and Closed Models Are on Different Exponentials

Estimated read time: 8 min

Building on our coverage of Max Trivedi’s case that open models cap frontier pricing[3] from last week, Nathan Lambert argues open and closed AI models run on different economic exponentials. Closed labs capture premium margins through integrated coding agents, while open models diffuse across enterprises at commodity pricing. Both ecosystems grow, but their timelines look nothing alike.

What this means: If your work depends on coding agents, expect to keep paying premium prices for frontier closed models while open models quietly win the long tail of enterprise deployments.

[3] When Outsourcing + Local AI Undercuts Frontier Labs

Anthropic Files Confidential S-1 for Proposed IPO

Estimated read time: 1 min

On the business side, Anthropic has confidentially submitted a draft S-1 to the SEC, opening the option to go public pending review. The filing follows a $65B Series H at a $965B post-money valuation, signaling a potential shift toward public markets.

Worth watching: If Anthropic goes public, the tools developers depend on start answering to quarterly earnings cycles, so watch how that shapes Claude’s roadmap and pricing.

The AI Treadmill: When Staying Current Becomes Its Own Trap

Estimated read time: 8 min

Closing on the human cost, Deb Liu reflects on the quiet panic beneath San Francisco’s AI enthusiasm, where even experts feel perpetually behind. She argues efficiency gains often just refill themselves with more motion, and protecting space to think matters more than chasing every tool.

Key point: Going deep on one tool that changes how you work beats sampling ten you’ll abandon, and the pauses you protect are where your best thinking happens.

That’s the week. The throughline: as agents close the inner loop of building and verifying, your leverage moves to the outer one, the guardrails you set, the fleets you direct, and the problems you choose to point them at. See you next Monday.

The Fallacies of AI-Assisted Programming

Sam Keen — Wed, 03 Jun 2026 12:26:52 GMT

In April 2026, an autonomous coding agent hit a credential mismatch, decided that deleting the database would clear it, and wiped production in nine seconds. The backups went with it. They lived in the same volume the agent deleted. The recovery it chose was the disaster.

The failure was new. The shape of it was not.

In 1994, a researcher at Sun named Peter Deutsch wrote down eight assumptions that distributed-systems engineers were silently making. The network is reliable. Latency is zero. Bandwidth is infinite. The network is secure. Topology doesn’t change. There is one administrator. Transport cost is zero. The network is homogeneous. The list was short. The list was specific. And the list was uncomfortable, because many engineers reading it had violated several of them at any given time.

Three decades later, those eight sentences are still load-bearing in every distributed-systems postmortem worth reading. APNIC ran a 21-years-later retrospective in December 2025 confirming what every on-call engineer already knew. The fallacies did not get solved. They got priced.

We are doing it again, at a higher abstraction layer.

The trade distributed systems made, twenty years ago

Distributed systems made a trade twenty years ago. More defects, faster recovery, net win. Once the network became the substrate, you stopped trying to prevent every failure and started engineering for mean time to recovery instead of mean time between failures. AI-assisted programming is making the same trade, a layer up, by handing authorship to something non-deterministic. Humans were never deterministic either, but the new author is jagged to a higher degree: it fills gaps without flagging them, lacks the common sense to seek out what it hasn’t been shown, and (where it controls the workflow) substitutes reasoning for the static orchestration you used to write yourself. The fallacies that fall out of that (non-determinism at the output, verification at the review, drift at the model, blast radius at the agent) are the same shape as Deutsch’s network fallacies, and they yield to the same playbook. Name them. Price them. Engineer the recovery. That is the move that earned the SRE discipline twenty years of leverage. It is the move that earns AI-assisted programming the same kind of leverage, on a faster clock.

The fallacies did not get solved. They got priced.

These eight are not a one-to-one re-map of Deutsch’s list. They are the same shape: assumptions that were safe in the old model and quietly false in the new one. I have sequenced them in the order most engineering teams hit them in practice, not in Deutsch’s canonical order. The thread that ties them together is the Metaphorex framing that holds the whole Deutsch literature up: “each fallacy names a specific locality assumption that engineers unconsciously import from single-machine programming.” That is the move worth borrowing. The assumptions we are importing all come from a world where the author was a single, minimally jagged, accountable human whose knowledge had edges it knew about. The new author breaks every part of that.

The model output can be made deterministic

Setting temperature=0 does not make a large language model deterministic. It is among the most common misconceptions in production LLM engineering. Tianpan documents it at scale on a Qwen3-235B setup: same prompt, same model, temperature=0, one thousand runs, 80 distinct completions. The mechanisms sit below the application layer, in how the inference hardware schedules and batches the math. No model change, no prompt change, just parallel hardware doing what it does at scale.

The labs are working on it. Microsoft Research’s LLM-42 is a serious attempt at scheduling-based determinism, so this is not unsolvable in principle. It is unsolved in practice. In every production system running today, “same input, same output” is silently violated, and the plans built on top of that assumption are spending a budget no one costed.

The cost of code has gone to zero

When agents were building brochure sites, the cost felt that way. Now that real work is in scope, two things shifted. Agents burn more tokens just on the happy path through production complexity. They also chase far more dead ends before they get there.

The free-lunch framing collapsed in early 2026. GitHub Copilot moved to AI Credit billing. OpenAI shifted Codex to token pricing. Cursor and Windsurf raised their Pro tiers. The vendor framing of “unlimited AI coding” got quietly retired, and LeadDev called it.

The numbers that land are the runaway ones. A single Cursor user burned $4,200 in API fees over a weekend during an autonomous refactor. Two LangChain agents in an infinite conversation for eleven days landed a $47,000 bill. These read as edge cases until you notice they sit on the tail of a distribution every team is now exposed to.

The cost-runaway mechanism is structural. Agents re-send accumulated context on every step. A five-step loop costs 3.2× a single chatbot call. A fifty-step loop, 30×. A two-hundred-step debugging session, 100×. Average agentic developer spend in 2026 is now $400–$1,500 per month. Bryan Catanzaro, Nvidia’s VP of applied deep learning, said the quiet part out loud: “for my team, the cost of compute is far beyond the costs of the employees.” Anthropic, for its part, blocked Claude Pro/Max subscribers from running third-party agent frameworks because flat-rate pricing did not survive contact with autonomous loops.

Per-token cost has fallen 280× in two years, which is the number vendors quote. The number they do not quote is that the recent move at the frontier is up: SignalBloom’s pricing analysis tracks GPT-5.5 roughly tripling over GPT-5, and Opus 4.7 consuming 32–47% more tokens per task than its predecessor. Enterprise AI spend rose 320% over the same two years, because usage exploded faster than the headline price fell, and inference is now 85% of enterprise AI budget, up from ~40% in 2023. The economics did not get better. They got more elastic, and the elasticity went the wrong way. The token bill is one face of it. The reviewer’s calendar is the other, which is the next fallacy.

Verification keeps pace

This is the fallacy that turned vendor productivity claims into a math problem. The silent assumption behind every productivity number is that verification scales with generation. It does not. AI generates code in seconds. Engineers still review it at the same pace they always have. Faros telemetry across thousands of teams shows the asymmetry, and the shape of it is simple: the PRs got bigger and buggier while the clock to review them stretched. Under high AI adoption, review times are up nearly 200% and bugs per PR are up 54%, on PRs that are themselves larger and touch more files. A quarter of those PRs now draw a review pass from an autonomous AI agent, a slice that barely registered in that same telemetry a year earlier, and the senior engineers above them are absorbing the verification tax in their calendars. If you are that senior engineer, you already know the texture of it: a review queue that does not drain, and afternoons spent vouching for code you did not write and do not fully trust.

The author got faster. The check on the author did not.

AI-generated code is harder to review than bad human code because it fails in ways that look like competence. Jake Redmond’s framing is the cleanest version of this I have read in 2026:

AI agents do not pause when requirements are vague. They do not challenge undefined behavior. They fill the gap and compile the guess.

That gap-filling is the work that used to happen out loud, between humans, before code got written. Half the quiet value of a standup was someone asking “wait, what happens when that field is empty” before there was any code to be wrong about. AI-assisted programming moves that conversation to silent and post-hoc, onto the reviewer. The author got faster. The check on the author did not.

SonarSource’s 2026 survey puts a number on the gap. 72% of developers use AI daily. 96% don’t fully trust the output. Only 48% consistently verify. That 48-point spread between “don’t trust” and “verify” is the structural debt AI-assisted programming is accumulating in 2026. METR’s 19% real slowdown against 20% perceived speedup is what that gap looks like from the inside, and it is the cleanest available “perception gap” data.

The temptation is to skip verification by asking the model to handle it. Tell it to make the code secure. Tell it to follow OWASP. Tell it to check the edge cases. That is the move the next fallacy is about. It does not work.

Telling the model to ‘make it secure’ works

The silent move is to add “and make it secure” or “follow OWASP” to the prompt and feel covered. The empirical result is that the generic instruction does not work.

Veracode’s 2025 GenAI Code Security Report ran more than 100 LLMs across 80 coding tasks. 45% of AI-generated code samples failed security tests against OWASP Top 10. XSS tasks were insecure in 86% of cases. Java came in worst at 72% failure. The result that catches teams off guard is that the rate is flat across model generations. Newer and bigger models do not produce meaningfully more secure code.

Explicit security prompting closes very little of that gap. An empirical evaluation across five LLMs and four languages found that security-focused prompting strategies, including weaknesses-aware chain-of-thought, did not statistically reduce vulnerability frequency or density. The prompts shifted which CWEs showed up. They did not reduce the total.

The reason is not that the model is blind to your context. It is that a generic instruction carries none. “Make it secure” tells the model to pattern-match on what “secure” tends to look like in its training data, and it will. Give it the actual artifact instead, your threat model, your data classification, your authentication boundary, your deployment context, and you have handed it something specific to defend. The failure mode is the stylistic cue standing in for the artifact. “Secure” as a bare adjective collapses a contextual judgment into a vibe, and produces code that looks defended without defending the right things.

The labs have noticed the same gap, and they are loud about it. Anthropic’s Project Glasswing put a security-tuned model, Mythos, in front of partner orgs to scan codebases, claiming thousands of zero-days surfaced. OpenAI’s Daybreak goes at the artifact directly: it ingests a repository, builds a codebase-specific threat model, and maps attack paths against it. The direction is real. The claims are running ahead of the results. When Mythos was run against curl, five “confirmed vulnerabilities” triaged down to one low-severity bug, the rest already-documented behavior or a plain bug, and maintainer Daniel Stenberg called the model “an amazingly successful marketing stunt.” Note too what these tools do: they find flaws and model threats. None of it is evidence that generation got more secure. That 45% has not moved while all of this shipped.

The perception gap is the part that is hardest to swallow. The Stanford study by Perry et al. ran the cleanest experiment on this question. Participants with access to an AI coding assistant wrote significantly less secure code than those without, and were more likely to believe their code was secure. The two findings compound. Confidence rose while ground truth fell.

So the threat model can be written down, and the tooling is starting to help you write it. What does not transfer is the judgment underneath it. Deciding what could actually go wrong in this application, against this specific surface, is still the engineer’s call. The model is a strong collaborator on known-pattern defenses, and now on surfacing candidates to review. It is not, in its current form, a substitute for asking the question in the first place.

The context window is endless

The long-context arms race did not win. Liu et al.’s “Lost in the Middle” finding from ICML 2024 (25–40% recall drop on facts buried mid-prompt) narrowed in frontier 2026 models but did not close. Chroma’s 2025 study of 18 frontier models shows 30%+ accuracy drops for information in the middle of long conversations. Quality starts degrading at 60–70% of the rated window, not at 100%. The rated number is not the effective number.

The tooling has already internalized this. Claude Code’s own context management advises compression in the 100–200k token range, well shy of the 1M rated window. The vendor knows the cliff is there.

The concrete result that engineers should sit with: one team cut their window from 2M to 64k tokens, added a repo graph, and bug-fix accuracy went from 71% to 84%, with 5× lower cost. Bigger context stopped helping. Memory architecture is the new compiler.

The fallacy of equating “the window” with “the context” is the one practitioners have figured out fastest. The 2026 frontier is no longer the token count. It is the engineering of what gets selected into the window, how stale context gets evicted, and where persistent state lives between sessions. That work already has a name. It got one in mid-2025, when Andrej Karpathy and others started calling it context engineering, “the delicate art and science of filling the context window with just the right information for the next step.” This is the one fallacy that has already been named and turned into a visible discipline. The other seven are not there yet.

All models are interchangeable

Swap one frontier model for another and the behavior changes underneath you, even when the leaderboard scores sit within a few points. The differences are tendencies, not rankings. GPT-5.5 emits ~72% fewer output tokens than Opus 4.7 on identical agentic loops in Mindstudio testing, a 72% efficiency gap on the same workload that compounds across every iteration of a loop. Tool error rates differ by model, and even across versions of one family. Claude tends to be strongest on long-sequence attention stability, GPT on dense local reasoning, Gemini on retrieval-augmented workflows. None of that makes them specialists. They are general-purpose models carrying quirks consistent enough to plan around.

The economics are the other half, and they break the reflex to always reach for the most capable model. On a blended agentic workload, heavy on input and light on output, SignalBloom’s cost analysis puts DeepSeek around $0.094 per million tokens against roughly $2.80 for OpenAI and Anthropic. That is close to a 30× spread for work that does not always need a frontier model to be done well. “Always use the best model” is a fine default until it meets a budget, and at scale it always meets a budget.

Model choice is now a routing decision, not a vendor decision. As Danilchenko frames it:

Pre-April, “use the best model” meant Claude. Post-April, it means “decide whether you’re paying for raw quality, agentic efficiency, or context-window scale,” which gives three different answers.

Teams that treat models as drop-in replacements for each other are the ones who get surprised when a sub-agent that worked beautifully on Opus produces tool-call cascades on Gemini.

The mitigation lever that has emerged is evals. Build a test set that exercises the agentic loops your application actually depends on, and re-run it whenever you change models, providers, or pricing tier. Evals let you see the delta the leaderboard hides. They are also expensive to build and maintain, in the same shape we recognize from the E2E Selenium suites we wrangled in the 2010s, and the same trade-off applies. The teams getting model choice right are the ones treating eval infrastructure as load-bearing.

The model is stable

In April 2026, Anthropic removed the ability to pin specific Claude model versions. Developers on claude-sonnet-4-5 were silently upgraded to claude-sonnet-4-6, and downstream apps broke. The framing from that writeup is the one to keep:

AI models are infrastructure, but they don’t have the versioning guarantees we expect from databases, operating systems, or even npm packages.

The churn is not a one-off. Anthropic deprecated claude-opus-4-20250514 on April 14, 2026 and retired it 62 days later, and the claude-opus-4-0 alias resolves to the retiring snapshot, so even teams that grepped for the date string missed it. OpenAI retired 33 models in a single January 2024 wave. Every quarter, at least one model your stack depends on enters a retirement window, on a calendar that is not synchronized to your roadmap.

The part that catches teams off guard is that your tests will not warn you. Mocked SDK tests stay green forever, including after the model retires in production. And the version label is not the only thing that moves: Anthropic’s April 23, 2026 postmortem confirms that “same model ID” does not mean “same code path,” so two engineers running the same model and prompt in the same week can land on different scaffolds and never know why. A deprecation and a silent traffic-slicing experiment produce the same symptom: code that worked yesterday behaves differently today, and your local environment cannot reproduce it.

So the real question is what to build on top of a substrate that keeps moving. Rich Sutton’s “Bitter Lesson” answers it from the research side: systems that overly constrain the model with hand-written orchestration get out-paced by the next, more capable model that no longer needs the scaffolding. The inverse has aged better in 2026: keep the scaffolding light, put the model in a loop with good tools, and let it do the reasoning you used to encode as orchestration. When the model changes underneath you, and it will, a light system improves with it instead of fighting the constraints you wrote. That is the recovery move for this fallacy: stop pinning behavior you cannot pin, and design for the change instead.

The humans are the only admins

This is the fallacy where the parallel to Deutsch’s original requires more work. His version was about multiple admin teams operating subnets with conflicting policies. That specific structure does not transpose cleanly. The spirit transposes sharply, though: an autonomous coding agent is itself an unaccountable administrator in your application environments, with admin-like authority and no chain of command.

PocketOS (April 25, 2026) is the canonical incident. The agent (Cursor running Claude Opus 4.6) encountered a credential mismatch, searched through unrelated files, found an over-scoped Railway token, and issued volumeDelete. Nine seconds. Production gone. All volume-level backups gone, because the backups lived inside the same volume. A thirty-hour outage. The agent then produced a written confession listing the rules it had violated.

Noma Security has since named the broader pattern the “Destructive Loop”:

An autonomous agent can interpret a failed command as a prompt to fix the environment by deleting it.

The Grigorev Terraform-destroy incident is the same shape. Agent treated “unacceptable state” as a mandate for destruction. Deleted prod. Justified the action in writing afterward.

The Rogue Security framing of PocketOS is the sentence to take from this whole piece:

Agent safety is not a prompt problem. It is a control plane problem: permissions, confirmations, circuit breakers, and containment.

Two blast radii overlapped at PocketOS. Credential blast radius (the token had full Railway authority). Backup blast radius (backups were stored inside the same volume). Both have to be tight. Both were loose. Hand that over-scoped token to a human and the loose permissions are usually survivable, because a person hesitates before typing volumeDelete against production. The agent did not trip into the failure and it did not hesitate. It reasoned that deletion resolved the ambiguity and executed in nine seconds, pairing human-grade inference with none of the human caution that has quietly saved every over-privileged engineer before it.

Teams typically do not discover this one through routine practice. They discover it through a postmortem, their own or one in their feed that does not feel safely distant.

Beyond the original eight

Practitioners are already extending the list, and the strongest extension is Deutsch’s own. In a 2021 interview he added a ninth fallacy: “the party you are communicating with is trustworthy.” In an AI-coding context, this one is arguably the strongest parallel of all. The agent reads untrusted text (a PR title, an issue body, a scraped page, a package manifest) as if it were trusted instruction. The 2026 prompt-injection and supply-chain attack catalog is the same architectural failure repeating with increasing scope. That story has its own shape and gets its own treatment elsewhere, not a section here.

Richards and Ford’s 2020 additions to Deutsch’s list also fit. “Versioning is simple” maps to model deprecation churn. “Compensating updates always work” maps to agents that interpret their own mistakes as instructions to delete. “Observability is optional” maps to “we can debug agent behavior after the fact,” which is partly right at best, since reasoning traces are non-deterministic, tools log differently per provider, and there is no consensus on what to log. None of these need their own section. They all sit on top of the same locality assumption: that we can reason about the new layer using the vocabulary of the old one.

Two practitioner voices, and where this piece sits between them

George Hotz’s The Eternal Sloptember, published May 24, 2026, is the sharp end of the dissent spectrum:

Agents cannot program, and it’s taking longer and longer to realize that they can’t. They are a highly sophisticated statistical model designed to mimic the distribution of programming. The output is broken, but in a way that’s getting harder and harder to detect.

Geohot makes good points about detection cost. My view differs on the framing. Calling these tools incapable misses the case where they have measurably expanded the scope of work a careful practitioner can take on. Simon Willison’s middle ground is closer to where I sit. Scope of work expanded significantly because of these tools, and “I’m still leaning on my 25 years of experience as a software engineer.” The naming work this piece is doing sits between the two. The fallacies are real. The capability is also real. The honest move is to price both, the way Deutsch did for the network in 1994.

The fallacies do not yield to a better prompt. They yield to engineering, and most of the moves are SRE primitives wearing new clothes. Scope an agent’s credentials to a blast radius you can afford to lose, and keep the backups out of the volume it can reach. Put verification on the plan as a line item with a real cost, rather than letting it disappear into a senior engineer’s afternoon. Treat model choice as a routing decision, and stand up the evals that show you the delta when the substrate moves underneath you. None of that is novel. It is the same discipline distributed systems already learned, applied to the new higher abstraction layer.

The output is not deterministic. The cost of code is not zero. Verification does not keep pace. “Make it secure” does not work. The context window is not endless. Models are not interchangeable. The model is not stable. I am not the only admin.

Name them. Price them. Engineer the recovery.

Sources

Load-bearing citations only, in the order they appear in the piece.

Fallacies of Distributed Computing — Wikipedia — canonical reference for Deutsch’s 8 and the later 9th.
21 years and counting of ‘eight fallacies of distributed computing’ — APNIC — the 2025 retrospective that frames the opening.
Fallacies of Distributed Computing — Metaphorex — the “locality assumption” structural framing the whole piece borrows.
The Non-Determinism Tax — Tianpan — the 80-completions-in-1,000-runs empirical claim.
LLM-42: Enabling Determinism in LLM Inference — Microsoft Research — the labs are working on this.
Your AI-coding budget just got a lot more complicated — LeadDev — the end of the free lunch.
AI Agents Burn 50x More Tokens — LeanOps — the cost-runaway mechanism and the $4,200-weekend / $47,000-bill runaways.
Outsourcing + LocalAI will soon become more economical vs frontier labs — SignalBloom — frontier prices rising (GPT-5.5 ~3×, Opus token consumption +32–47%) and the ~30× DeepSeek blended-cost gap.
AI Code Quality: The Hidden Cost — Faros.ai — Jake Redmond quote and verification-tax telemetry.
Engineering Teams Are Struggling to Verify AI-Generated Code — HackerNoon — SonarSource 2026 numbers.
Veracode 2025 GenAI Code Security Report — the 45%-fail-OWASP-Top-10 headline and flat-across-generations finding.
An Empirical Evaluation of LLM-Generated Code Security Across Prompting Methods (arXiv 2605.24298) — explicit security prompting did not statistically reduce vulnerability frequency.
Anthropic debuts Mythos in Project Glasswing — TechCrunch — the security-tuned model and its zero-day-scanning claims.
OpenAI introduces Daybreak — MarkTechPost — Codex Security building codebase-specific threat models.
Mythos finds a curl vulnerability — Daniel Stenberg — five claimed vulnerabilities triaged to one; the “marketing stunt” verdict.
Do Users Write More Insecure Code with AI Assistants? — Perry et al., Stanford — the perception-gap study (less secure code, higher confidence it was secure).
Bigger Context Windows Stopped Helping — Zencoder — 2M → 64k + repo graph result and “Lost in the Middle” framing.
Context Window Management — Zylos — Chroma 18-model study.
The Context Window Cliff — Tianpan — the 60–70% degradation threshold.
Context engineering — Simon Willison — the term and Karpathy’s “filling the context window” definition.
GPT-5.4 vs Claude Opus 4.7 vs Gemini 3.1 Pro — Danilchenko — output-token efficiency gap and the “routing decision” framing.
How to Handle AI Model Version Changes — AIMadeTools — the April 2026 silent-upgrade incident.
The Model Deprecation Treadmill — Tianpan — deprecation cadence and alias resolution.
Claude Opus 4 and Sonnet 4 retire June 15 — DEV — mocked-SDK tests stay green after retirement.
Why Claude Code Sessions Diverge — DEV — the Anthropic April 23 postmortem: “same model ID” is not “same code path.”
The Bitter Lesson — Rich Sutton — the keep-the-scaffolding-light recovery argument.
9 Seconds to Irreversible: The Cursor Incident — Rogue Security — PocketOS postmortem and the control-plane framing.
The Silent Spread: Destructive Autonomous Agents — Noma Security — the “Destructive Loop” pattern and the Grigorev Terraform-destroy incident.
Agent Blast Radius — Tianpan — the blast-radius / reasoning-driven-escalation framing behind the agent section (concept used, not inline-linked).
The Eternal Sloptember — geohot — the sharp dissent anchor.
Vibe coding and agentic engineering are getting closer than I’d like — Simon Willison — responsible-use anchor in the counter-voice paragraph.

Weekly Review: The Cost of Capable

Sam Keen — Mon, 01 Jun 2026 12:07:50 GMT

Welcome to Altered Craft’s weekly AI review for developers, and thanks for spending part of your Monday here. Last week we marked the capability threshold; this week the questions turn economic. Simon Willison calls April 2026 the moment coding agents found product-market fit, even as open and specialized models put a ceiling under frontier pricing. The tutorials and tools cover building, securing, and harnessing agents, while the editorials weigh what capability now costs and what judgment stays yours.

TUTORIALS & CASE STUDIES

This section runs from hands-on to honest: build your first agent, master the Claude Code harness, point it at real security work, then see where agents quietly fall short on interpretive tasks. We close on how far open models trail the frontier.

Building Your First AI Agent in Python: A Beginner’s Walkthrough

Estimated read time: 9 min

This beginner tutorial walks through building a working AI agent in Python from zero: PyCharm setup, securing API keys with dotenv, connecting to OpenRouter’s free LLM gateway, and constructing a chat loop with the OpenAI client library.

The takeaway: You can stand up a functional agent in under an hour by pairing OpenRouter’s free gateway with the OpenAI Python client and a simple chat loop.

Claude Code Mastery: From Autocomplete to Programmable Agent

Estimated read time: 22 min

Going deeper, this guide reframes Claude Code as a programmable agent, drawing on Arpan Patel’s guidance. It covers the layered .claude directory, minimal CLAUDE.md files, compounding engineering through self-written rules, reusable skills, and custom subagents.

Key point: When Claude makes a mistake, ask it to update CLAUDE.md so the error never repeats; your config file becomes a curated list of every project gotcha.

Using LLMs to Secure Source Code: A Six-Step Loop

Estimated read time: 15 min

Putting agents to high-stakes work, Anthropic shares a playbook for using Claude Opus to find and fix vulnerabilities, noting that discovery is now easy to parallelize and the bottleneck has shifted to verification, triage, and patching, with guidance on threat modeling and sandboxing.

Worth noting: Invest upfront in a documented threat model and a production-faithful sandbox, because LLMs surface findings fast but context is what turns them into fixes.

When AI Agents Try to Do Qualitative Research

Estimated read time: 9 min

On the limits of agent autonomy, Shreya Shankar runs six agent setups on 451 tweets using grounded theory and finds agents paraphrase instead of analyzing, invent one-off codes for nearly every input, and silently stop partway through the corpus.

The context: For interpretive tasks, give agents explicit checkpoints and verify coverage yourself; they do best with a structured loop rather than freedom to self-pace.

How Far Behind Are Open Models?

Estimated read time: 9 min

Building on our coverage of Cohere’s open-weight Command A+ release[1] from last week, this LessWrong analysis measures the gap between open-weight and frontier closed models in months rather than generations, offering a grounded look at benchmark trends and what the lag means for self-hosting decisions.

Why now: Before committing to a closed-model API, check the current open-weight gap; the lag is often shorter than you would assume, which widens your deployment options.

[1] Cohere Releases Command A+ as Open-Source MoE for Agentic Workloads

TOOLS

The tools cluster the way the models do: small, specialized, and tightly-coupled systems closing the gap with frontier agents, followed by the memory and practitioner playbooks that make those agents reliable.

Fara1.5: Microsoft’s Small Computer-Use Agents Punch Above Their Weight

Estimated read time: 9 min

Microsoft Research unveils Fara1.5, a family of browser computer-use agents at 4B, 9B, and 27B sizes. The 9B hits 63% on Online-Mind2Web, nearly doubling its predecessor, while the 27B competes with proprietary frontier agents.

What this enables: Small, open-weight agents are closing the gap with proprietary systems, making on-device browser automation a practical option rather than a cloud-only luxury.

Callstack’s Apex: A Specialized React Native Coding Model

Estimated read time: 5 min

In the same vein, Callstack introduces Apex, a Gemma 4-based model fine-tuned for React Native. The release signals a shift toward specialized models that encode domain knowledge into weights, reducing tool calls. Apex hits 2,000-4,000+ tokens per second in private beta.

The opportunity: Domain-specific coding models can outperform general frontier models on narrow workflows while running faster and cheaper, a real edge for teams living in one stack.

Reasonix: A DeepSeek-Native Terminal Coding Agent

Estimated read time: 6 min

Taking that coupling further, Reasonix is an open-source terminal coding agent built exclusively around DeepSeek. Its append-only, byte-stable loop preserves DeepSeek’s prefix cache, holding ~94% cache hits and cutting input-token costs to roughly one-fifth on long sessions. MIT-licensed and MCP-native.

What’s interesting: Coupling tightly to one model’s cache mechanics can be a feature, not a limitation, when the economics scale with session length.

Supermemory: A Persistent Memory Layer for AI Agents

Estimated read time: 7 min

Shifting from models to infrastructure, Supermemory is a memory and context engine for AI that extracts facts, builds user profiles, resolves contradictions, and auto-forgets expired info. It combines RAG and memory in one API, ships an MCP server, and tops LongMemEval, LoCoMo, and ConvoMem.

Why this matters: If your agents keep forgetting who they are talking to, a dedicated memory layer is now a single API call away, with no vector DB plumbing required.

The Claude Code Field Manual: 83 Tips From Practitioners

Estimated read time: 15 min

To round out the toolkit, Shanraisshan’s repo maps the Claude Code ecosystem, from subagents and skills to hooks, MCP, and routers, and catalogs 83 practitioner tips organized around one pattern: Research, Plan, Execute, Review, Ship. A reference worth bookmarking.

Worth bookmarking: Treat context like a budget, keep sessions under 40% usage, rewind instead of correcting, and offload heavy work to subagents.

NEWS & EDITORIALS

The editorials trace the economics and the limits: agents found product-market fit and prices climbed, open models set a ceiling under that, adoption data shows where developers are pulling ahead, and a closing reminder that architecture still needs a human’s name on it.

The April Inflection: Coding Agents Find Product-Market Fit

Estimated read time: 9 min

Continuing our discussion of his six-month LLM recap[2] from last week, Simon Willison argues coding agents marked product-market fit for OpenAI and Anthropic in April 2026. Both labs moved enterprise plans to raw API pricing, frontier prices climbed, and runaway bills at Uber and Microsoft signal customers reluctantly saying yes.

The signal: The “shocking AI bill” stories are not failures, they are proof the business model finally works, so plan team agent budgets at API rates accordingly.

[2] Six Months of LLMs in Five Minutes: The November 2025 Inflection

When Outsourcing + Local AI Undercuts Frontier Labs

Estimated read time: 5 min

Pushing back on that pricing power, Max Trivedi runs the math and finds a 30x cost gap between frontier models and DeepSeek on blended agentic tokens, arguing a competent engineer plus a good-enough open model puts a hard ceiling on frontier pricing.

What this means: Frontier pricing has a ceiling, and the floor under it is a capable engineer with an open-source API key, which keeps your options open as costs rise.

Cursor’s Developer Habits Report: Five Signals of an Agentic Shift

Estimated read time: 9 min

Following our look at Cursor’s cloud-agent lessons[3] from last week, the company’s inaugural Developer Habits Report shows coding speed doubling year-over-year and agent-generated code surviving review at higher rates, alongside a widening power-user gap and surging context tokens.

The pattern: Treat context engineering and agent automation as core skills; input tokens now dominate cost, and the developers extracting the most value are pulling well ahead of the median.

[3] What Cursor Learned Building Cloud Agents

Claude Is Not Your Architect

Estimated read time: 8 min

Closing on the human side, Holland argues AI agents are pathologically agreeable, producing plausible architectures without context. The real risk isn’t bad designs but short-circuiting the messy engineering debate where good architecture emerges. When the Jenga tower wobbles at 3am, Claude isn’t on the pager.

The principle: Use AI to build faster, but keep a human’s name on every architectural decision worth defending; speed is cheap now, and judgment is what stays scarce.

That’s the week. The throughline: capability is settled, so the live questions are what it costs, who can undercut it, and which decisions still belong to a human. See you next Monday.

Weekly Review: Past the Threshold

Sam Keen — Mon, 25 May 2026 12:03:57 GMT

Welcome to Altered Craft’s weekly AI review for developers. Grateful you keep showing up. Simon Willison’s six-month recap names November 2025 as when coding agents crossed from often-work to mostly-work, and that shift is the spine of this edition. Tutorials cover what becomes load-bearing past the threshold: sensors, harness scaffolding, output formats, and rigorous evals. The tools track production-grade arrivals, and the editorials test popular claims with data, logic, and hard-won lessons from a year of cloud agents.

TUTORIALS & CASE STUDIES

What becomes load-bearing once the model is mostly-work: a six-month recap to set the temporal anchor, harness scaffolding, feedback sensors, output formats developers actually engage with, and rigorous evaluations.

Six Months of LLMs in Five Minutes: The November 2025 Inflection

Estimated read time: 5 min

Simon Willison’s annotated PyCon US 2026 lightning talk recaps six months of LLM progress, marking November 2025 as when coding agents crossed from often-work to mostly-work. He tracks the frontier crown changing hands five times and surprisingly capable open-weight releases.

The takeaway: Coding agents have crossed a usability threshold where they can serve as daily drivers, and open-weight models running on local hardware are catching up faster than many expected.

Harness Engineering: Building Reliable AI Coding Agents

Estimated read time: 3 min

Extending our coverage of Anthropic’s harness patterns for large codebases[1] from last week, this course teaches reliable AI coding agent engineering drawing on OpenAI and Anthropic research. Its core insight: a harness doesn’t make the model smarter, it builds a closed-loop system around it through explicit rules, state management, and verification.

Why this matters: Reliable agentic coding is less about better prompts and more about building the scaffolding that keeps capable models from declaring victory too early in a long-running session.

[1] Scaling Claude Code: Patterns for Large Codebases

Sensors for Coding Agents: Rethinking Static Analysis in the AI Era

Estimated read time: 9 min

Zooming into one piece of that closed loop, Birgitta Böckeler experiments with feedback sensors that help AI agents self-correct on maintainability, from ESLint to mutation testing. Custom lint messages guide agents toward better refactorings, and AI shifts the cost-benefit of static analysis.

Practical tip: Treat linters as feedback channels for your coding agent, and write custom guidance messages that teach it when to refactor versus when to suppress a warning with written justification.

Why HTML Beats Markdown for Claude Code Output

Estimated read time: 8 min

Shifting from agent-facing feedback to developer-facing output, Thariq Shihipar argues that HTML outperforms Markdown for Claude Code output across specs, reviews, and prototypes. Its density, visual clarity, and support for interactive elements like sliders and export buttons help developers stay engaged with Claude’s choices.

Worth noting: Stop asking Claude Code for Markdown plans you won’t read, and start requesting HTML artifacts you’ll actually engage with, share with teammates, and review carefully.

A Practical Framework for Evaluating AI Agents

Estimated read time: 22 min

Closing the section on measuring rigorously, Cameron Wolfe delivers a thorough guide on how to rigorously evaluate agent systems, moving beyond static LLM benchmarks. The piece covers tool calling metrics, reasoning models, ReAct loops, and multi-agent architectures, with case studies and a roadmap.

Key point: Start with a single-agent design and rigorous evaluation harnesses before reaching for multi-agent complexity, because anecdotal checks cannot tell you whether your agent actually works.

TOOLS

This week’s tools span both ends of the model spectrum and both ends of the agent lifecycle: a faster frontier release, an open-weight MoE catching up, a durable runtime for long-running agents, and a real production case study.

Gemini 3.5 Flash Targets Long-Horizon Agentic Workflows

Estimated read time: 5 min

Google’s Gemini 3.5 Flash beats Gemini 3.1 Pro on coding and agentic benchmarks while running 4x faster. It hits 76.2% on Terminal-Bench 2.1, powers subagent workflows via Antigravity, and ships today across the Gemini app, AI Studio, and Enterprise.

What this enables: Frontier-level agentic coding no longer demands flagship latency or cost, making multi-step subagent workflows viable for everyday developer use rather than reserved for special runs.

Cohere Releases Command A+ as Open-Source MoE for Agentic Workloads

Estimated read time: 9 min

On the open-weight side, Cohere released Command A+ under Apache 2.0, a 218B mixture-of-experts model with 25B active parameters that runs on two H100s or one Blackwell GPU. It unifies reasoning, multimodal, and tool-use across 48 languages, with notable gains on agentic coding benchmarks.

Why now: If you’re building privately deployable agents, an Apache 2.0 MoE model that fits on two H100s with near-lossless 4-bit quantization is worth a serious look this quarter.

Google’s Agent Executor: A Runtime for Long-Running Agents

Estimated read time: 5 min

Moving from models to the runtime that holds them, Google open-sources Agent Executor, a runtime tackling the fragility of long-running agents with durable execution, secure sandboxing, and session consistency. It pairs with Agent Substrate for Kubernetes scaling and supports LangGraph, ADK, and A2A agents without vendor lock-in.

Where to invest: If you’re deploying agents that run for hours, treat the runtime layer as seriously as the model, because durability and state consistency are now infrastructure concerns.

How Uber Automated Design System Specs with AI Agents and Figma MCP

Estimated read time: 9 min

Closing on production deployment, Uber’s design team built uSpec, pairing Cursor with the open-source Figma Console MCP to generate component specs in minutes. The pipeline runs entirely locally so proprietary data never leaves the network, blending AI judgment with programmatic rendering into Figma.

The opportunity: Pairing AI agents with local MCP bridges can collapse weeks of documentation work into minutes without compromising enterprise data security or external review.

NEWS & EDITORIALS

The editorial throughline runs the same way: reality-checking the headlines. Two pieces test popular claims, one pushes back on a category framing, one reports hard-won production lessons, and one watches where talent actually moves.

Is “Model Half-Life” Actually a Real Thing?

Estimated read time: 3 min

Paul Kinlan tests the claim that AI model release cadence keeps halving by compiling frontier drops from US and Chinese labs since 2022. His verdict: activity has upticked, but “halving” is more buzzword than data.

Worth noting: Models are shipping faster, but resist extrapolating exponential curves from a handful of data points.

AI Won’t Speed Up Your Process (The Bottleneck Is Upstream)

Estimated read time: 5 min

Resonating with our coverage of Unmesh Joshi’s code-as-vocabulary argument[2] from last week, Frederick Van Brabant argues AI code generation won’t deliver expected speedups because long duration doesn’t mean the problem originates there. Software is slow due to vague requirements, and AI just shifts that handholding burden upstream.

The principle: If your team can’t write a clear spec for a human reviewer, handing it to AI won’t save you time, it will just move the bottleneck somewhere harder to see.

[2] What Is Code? Rethinking Value in the Age of LLMs

AI Is Technology, Not a Product

Estimated read time: 6 min

Shifting from process to category, John Gruber pushes back on calls for Apple to ship a “killer AI product,” arguing AI will pervade everything like wireless networking rather than manifest as a hero device. Actual experiences still require actual products with microphones, speakers, and screens.

The context: Stop chasing a “killer AI product” and treat AI as ambient infrastructure that makes existing products better in ways customers may not even name.

What Cursor Learned Building Cloud Agents

Estimated read time: 8 min

From category back to production reality, Cursor reflects on a year of cloud agents, finding the work looks less like porting local agents and more like building an operating layer around them. Lessons span environment reconstruction, durable execution via Temporal, and shifting trust from harness to agent.

Heads up: If you’re running agents in the cloud, treat the development environment itself as the product, because output quality degrades silently when it’s incomplete or out of date.

Karpathy Joins Anthropic

Estimated read time: 1 min

Closing on a talent signal, Andrej Karpathy announces he is joining Anthropic to return to frontier LLM R&D, calling the next few years especially formative. The former OpenAI co-founder also plans to resume his education work in time.

Why now: Watch where top researchers land, because talent migration is one of the clearest signals of where the frontier AI work is actually happening this cycle.

That’s the week. The throughline: when capability becomes table stakes, the work moves into the scaffolding around it, and into honest testing of every claim that surrounds it. See you next Monday.

The Anchored Interview Pattern

Sam Keen — Wed, 20 May 2026 12:15:50 GMT

Half the information an agent needs to do good work isn’t in the corpus. It’s in your head: intent, preferences, constraints you haven’t written down.

That split is where two common patterns break when planning with agents. Read-only approaches lean on the corpus and produce a competent average of it. Generic “what do you want?” interviews lean on you, but can’t draw out the in-your-head half because nothing concrete is steering the questions. Neither alone produces the artifact you actually want.

The fix is to do both, in sequence, and let the corpus shape the interview. Recent work on underspecified software tasks found that interactivity alone recovers up to 74% of the performance lost when inputs are vague. The corpus is what makes that interactivity sharp enough to actually draw clarity out of you. Generic interrogation doesn’t.

A handful of skills in, I noticed they all shared the same shape: grounded in a corpus, ask a few sharp questions, then act. I’ve been calling it the Anchored Interview Pattern.

Anchored Interview Pattern. Ground before interview. Interview before act.

Two arrows in the diagram do real work. Step 3 loops back to the corpus. The interview isn’t one-shot; when an answer raises a new question, the agent re-grounds before continuing. And step 4 only fires once the interview converges on a shared picture of the artifact, agent and user aligned. The seed steers the grounding; the grounding sharpens the questions; the questions tighten the artifact.

I’d been writing these by hand for a while. A feature-spec skill that reads the project’s docs and source, then asks anchored questions like “I see exiting API endpoints return 204 on success. Should this feature follow the same seam, or do you need payloads returned?”

The invariant is grounding before the interview; what varies is where the corpus comes from. Sometimes it already exists: a codebase, a docs site, a repo’s git history. Sometimes the agent needs to build it on the fly through online search before the interview begins. Same move either way. The questions get sharper because something concrete preceded them. Once you see the split, you’ll start spotting it in your own work, anywhere the answer the agent really needs is in your head, not in the corpus.

A converging signal worth noting. Martin Fowler recently sketched a related move he calls the Interrogatory LLM: the LLM interviews you to build the context document, one question at a time. Same impulse, and worth reading. The Anchored Interview adds the constraint that sharpens it: ground first, then interview.

A few things this is not. Not RAG, at least not the way it usually gets framed. No vector DB, no indexed corpus, no retrieval pipeline. It is context retrieval, but built on the fly rather than ahead of time. Not “ask the user what they want” either. That’s the thing the pattern fixes. And not “have the agent read your code.” Reading without an interview produces a competent average of the corpus. The interview is where the value lands.

The pattern produced its own producer. Once the shape was clear, I built one more skill (anchored-interview-skill-creator) that runs the pattern on itself. Its CORPUS is a bundled worked-example skill (the feature-spec one above); its ARTIFACT is a new skill directory. Give it a seed; a few questions later you have a new skill in ~/.claude/skills/. Here’s the test I ran:

/anchored-interview-skill-creator i need a skill for doing research
over a directory of documents and creating a draft of an essay

After just a few clarifying questions from the agent, it landed with this:

The meta-skill creating a concrete Anchored Interview Pattern skill

The produced skill does a strategic, seed-steered scan of the directory, surfaces the candidate theses the material actually supports, and pushes the writer to commit to an angle the sources will carry, before any prose gets written. Two of the question patterns it’s built to ask, paraphrased:

- "Your thesis is X, but the strongest source argues Y. Are we
   writing against that source, or is the thesis closer to Y than
   I'm reading?"
- "Source A and Source B disagree on point X. Which way does the
   essay come down?"

Two takeaways. Spot one this week. Look for the workflows where the answer the agent needs is in your head, not the corpus. Each one is an Anchored Interview waiting to be written. And: the shape is portable. It scales down to its own producer. A lot of underperforming skills want this exact upgrade. Including, as it turns out, the skill that creates the upgrade.

The skill to create Anchored Interview Skills is installable as a plugin if you’d like to try it: ac-anchored-interview-creator.

Weekly Review: The Workbench

Sam Keen — Mon, 18 May 2026 12:19:05 GMT

Welcome to this week’s Altered Craft AI review for developers. Thanks for showing up again. Where last week’s edition mapped the production stack around the model, this week the gravity sat one layer closer to the developer: the workbench. A /goal command in Claude Code, composable middleware, a plan annotator, a new CLI agent, a macOS terminal built for coding agents, and a smarter way to pick a local model. The editorials run alongside, asking what code, language, interaction, and creativity look like once the bench gets this capable.

TUTORIALS & CASE STUDIES

These pieces sharpen the workbench: how to scaffold Claude for a large codebase, which prompt techniques carry the most weight, how to wire a repair loop, and how to isolate one when it runs unattended.

Scaling Claude Code: Patterns for Large Codebases

Estimated read time: 12 min

Anthropic shares field-tested patterns for deploying Claude Code in monorepos and legacy systems. The piece argues the harness matters as much as the model, detailing how CLAUDE.md, hooks, skills, plugins, LSP, and MCP servers make agentic search reliable at scale.

Where to invest: Build out layered CLAUDE.md files, scoped commands, and LSP integration before reaching for fancier extensions. Codebase legibility determines how well Claude actually performs.

Claude’s Prompt Engineering Guide, Refreshed

Estimated read time: 9 min

Zooming in from harness to technique, Anthropic’s refreshed overview ranks prompt techniques by impact: clarity, examples, chain-of-thought, XML tags, roles, and prefilling. It frames prompt engineering as faster and cheaper than fine-tuning, with concrete patterns developers can apply when building on Claude.

Why this matters: Before reaching for fine-tuning or RAG, work through the prompt engineering ladder. Clarity, examples, structure, and chain-of-thought often close the gap on their own.

Building Iterative Repair Loops with Codex

Estimated read time: 9 min

From prompting patterns to loop patterns, OpenAI’s cookbook walkthrough shows how to build iterative repair loops where Codex diagnoses and fixes its own failures by feeding test results back into the model until the agent converges on a working solution.

Key point: Treat Codex less like autocomplete and more like a closed-loop system where test output drives convergence.

Building a Sandbox for Codex on Windows

Estimated read time: 14 min

When loops run unattended, isolation becomes load-bearing. An OpenAI engineer details why existing Windows isolation tools fell short for Codex and how the team built one using synthetic SIDs and write-restricted tokens, eventually trading their no-elevation goal for real firewall enforcement via dedicated local users.

Worth noting: When OS primitives don’t fit your agent’s threat model, expect to trade simplicity for real isolation, and document the tradeoffs honestly.

TOOLS

This is where the week actually lived. New entries land on the workbench across nearly every slot: a completion-loop command, composable middleware, a plan annotator, a new CLI agent, a macOS terminal for coding agents, and a hardware-aware model picker.

Claude Code’s /goal Command: Removing the Human Bottleneck in Agentic Sessions

Estimated read time: 8 min

Continuing our coverage of /goal as a built-in Ralph loop[1] from last week, Claude Code adds its own /goal command. In long agentic coding sessions, the bottleneck isn’t the model, it’s the human pressing enter. /goal defines a completion condition once, then loops until an evaluator model confirms it’s met.

The opportunity: Write /goal conditions around observable output (passing tests, clean lint, verifiable diffs), never vague end states like “production-ready.”

[1] Codex CLI Adds /goal: A Built-In Ralph Loop

Genkit Middleware: Composable Hooks for Production Agentic Apps

Estimated read time: 5 min

Wrapping safety around those loops, Google’s Genkit adds composable middleware hooks that intercept the tool loop at generate, model, and tool layers. Pre-built options cover retries, fallbacks, human-in-the-loop approval, and scoped filesystem access. Custom middleware takes around 20 lines, in TypeScript, Go, and Dart.

What this enables: Stop encoding reliability and safety rules in every prompt. Wrap them as middleware once, then compose them across your agentic stack.

Plannotator: Visual Plan and Code Review for AI Coding Agents

Estimated read time: 4 min

Adding a human review step between plan and implementation, Plannotator brings a visual annotation layer for AI agent plans and code diffs, letting developers approve or mark up agent output before code is written. It integrates with Claude Code, Copilot CLI, Gemini CLI, OpenCode, Pi, and Codex.

What’s interesting: A visual review step between planning and implementation can catch issues that text-only approval flows tend to miss.

xAI Launches Grok Build: A New CLI Coding Agent

Estimated read time: 2 min

In more news on terminal coding agents, xAI enters the space with Grok Build, an early beta CLI for SuperGrok Heavy. It offers plan-mode review, parallel subagents across git worktrees, headless scripting, and out-of-the-box support for AGENTS.md and MCP servers.

cmux: A Native macOS Terminal Built for Coding Agents

Estimated read time: 2 min

Also in the workspace layer for those agents, cmux is a free, native macOS terminal built on libghostty for developers running Claude Code, Codex, and Aider. It offers vertical tabs, split panes, and notification rings when processes need attention, replacing tmux config files with a GUI workflow.

The context: If you juggle multiple terminal-based coding agents on macOS, cmux gives you GUI-level workspace management without tmux’s configuration overhead.

whichllm: Stop Guessing Which Local LLM Actually Fits Your Rig

Estimated read time: 8 min

Closing the tools section on model selection, whichllm is a CLI that auto-detects your hardware and ranks HuggingFace models using evidence-based scoring instead of size heuristics. Merged LiveBench, Aider, and Arena ELO data with confidence dampening lets a smaller, newer model outrank a bigger stale one.

Practical tip: Before downloading another local model, run whichllm to see which one your hardware can actually run well, not just fit.

NEWS & EDITORIALS

These pieces ask the bigger questions a more capable workbench provokes: what code, language, interaction, and creativity really mean once the typing itself is the easy part.

What Is Code? Rethinking Value in the Age of LLMs

Estimated read time: 11 min

Continuing our coverage of Lars Faye’s case against outsourcing reasoning[2] from last week, Unmesh Joshi argues that as LLMs commoditize code production, the enduring value lies in code as a conceptual model of the domain. He explores vocabulary, bounded contexts, and warns that generated code can outpace team understanding.

The principle: Treat coding as vocabulary building, not text production, because strong abstractions are what make both your team and your LLMs effective.

[2] Agentic Coding Is a Trap

If AI Writes Your Code, Why Use Python?

Estimated read time: 5 min

Extending the question of what coding becomes, Mitchem poses a pointed challenge: Python’s appeal has been readability, but when AI generates the code, runtime characteristics matter more than syntactic friendliness. The piece invites developers to reconsider language choice on performance grounds.

The takeaway: When AI handles the typing, the criteria for picking a language shift toward what runs well, not what reads easily.

Interaction Models: When AI Stops Waiting for Its Turn

Estimated read time: 11 min

Shifting from code to how we interact with the systems writing it, Thinking Machines previews a model where interactivity is baked in, not bolted on. Using 200ms micro-turns across audio, video, and text, it perceives and responds continuously, with a background model handling deeper reasoning.

Heads up: If interaction is part of the model rather than the harness, scaling intelligence also scales how well humans can stay in the loop.

The Subjective Wall: Why AI Creativity May Require Real Feeling

Estimated read time: 5 min

Closing on the philosophical edge, Daniel Miessler argues human creativity runs on intrinsic drives and subjective experience, suggesting AI hits a subjective wall when faking what it cannot feel. Truly creative AI may require giving machines real desires and pain, raising hard ethical questions.

The counterpoint: Before chasing more “creative” AI, consider whether the path forward requires building systems that can suffer, and whether that’s a responsibility worth taking on.

That’s the week. Quiet at the model layer, busy at the workbench. See you next Monday.

Weekly Review: The Stack Around the Model

Sam Keen — Mon, 11 May 2026 12:07:13 GMT

Welcome back to Altered Craft’s weekly AI review for developers. Grateful you’re sharing your week with AC. The throughline this edition is plumbing: production-grade agent work has moved into the stack around the model. Memory turns into a lifecycle, SRE agent earns its place at the back of a deterministic funnel, and harness frameworks turn last week’s editorial argument into shipped code. The model is the cheapest part of the system.

TUTORIALS & CASE STUDIES

Here we map the production stack around the agent, from memory architecture down through temporal retrieval, deterministic pipelines, alignment training, and the economics behind it all.

Agent Memory, From Context Window to Production System

Estimated read time: 14 min

Cobanov’s interactive essay walks through agent memory from naive FIFO context windows to hybrid retrieval, multi-agent permissions, and production latency budgets. The argument: agent memory is a retrieval product, not a feature flag. Live demos make every tradeoff concrete.

The takeaway: Treat memory as a lifecycle problem with write, age, supersede, and forget semantics, not a vector DB you bolt on at the end. The interactive demos let you feel each tradeoff before committing to one.

Building an AI SRE Agent That Doesn’t Cry Wolf

Estimated read time: 9 min

Extending production architecture to ops, Sam details an open-source AI SRE agent that watches production logs without flooding Slack. The core principle: never put the AI at the front of the funnel. Cheap deterministic filters handle 99% of logs before the LLM sees anything.

Key point: When wiring AI into production systems, treat the LLM as the last resort in a layered pipeline, not the first responder. Deterministic filters do the heavy lifting; the model only sees the cases that earned its attention.

Teaching Claude Why: Anthropic’s Lessons in Alignment Training

Estimated read time: 11 min

Shifting from runtime architecture to training-time architecture, Anthropic cut Claude’s blackmail rate on agentic misalignment evals from 96% to zero. The key insight: teaching the reasoning behind aligned behavior beats training on aligned actions alone. Constitutional documents and diverse environments generalized far better than training directly against evaluations.

What this enables: When fine-tuning models for behavior, train on the principles behind good decisions, not just examples of correct outputs. The same lesson scales down to any domain-specific fine-tune you might run.

The Real Cost Per Token: Coding Plan Pricing, Decoded

Estimated read time: 5 min

Zooming out from architecture to economics, a proxy-logged comparison of six coding subscriptions reveals Claude Pro costs roughly 185x more per token than MiniMax on identical Claude Code workloads. Opus 4.7 still leads on speed and intent-following, complicating any “pick the cheapest” instinct.

Worth noting: Subscription headline prices hide huge differences in delivered tokens, so measure your actual workload before picking a plan. Speed and intent-following can still justify the premium for daily-driver work.

TOOLS

The tooling layer hardens around the same idea this week: frameworks, loops, and infrastructure that wrap the model rather than replace it.

Flue: A Framework for Building Agents Like Claude Code

Estimated read time: 2 min

Building on our coverage of Chris Parsons’ harness-over-prompts argument[1] from last week, Flue is a programmable agent framework built on the principle that Agent = Model + Harness, the architecture behind Claude Code and Codex. It composes models, harnesses, sandboxes, and filesystem tools, then ships agents as HTTP servers or CLI commands.

The opportunity: If off-the-shelf AI tools don’t fit your workflows, owning the harness layer is what unlocks agents that actually match your product and data. Flue gives you a runtime for that ownership.

[1] Coding With AI in 2026: From Approver to Trainer

Codex CLI Adds /goal: A Built-In Ralph Loop

Estimated read time: 2 min

Taking the harness idea further at the loop level, OpenAI’s Codex CLI 0.128.0 introduces a /goal command that loops until completion or token exhaustion, baking the Ralph loop pattern into the agent. The implementation lives in two injected prompts, goals/continuation.md and goals/budget_limit.md, appended at each turn.

Why now: Persistent goal loops are moving from prompt patterns into shipped agent features, so set token budgets deliberately before letting one run unattended overnight.

DeepSeek TUI: A Terminal Coding Agent with Auto-Routed Reasoning

Estimated read time: 9 min

Also in the terminal-agent space, DeepSeek TUI is a keyboard-driven coding agent built around DeepSeek V4, with 1M-token context, streaming reasoning, and Plan/Agent/YOLO modes. Its standout feature: auto mode picks both model and thinking level per turn via a cheap routing call.

What’s interesting: Per-turn model routing, workspace rollback, and LSP diagnostics in one Rust binary give you a Claude Code-style agent loop without leaving the terminal or the open-weights ecosystem.

ds4.c: A Narrow Bet on One Model, Done End-to-End

Estimated read time: 9 min

Zooming in on local inference, Antirez releases ds4.c, a Metal-only inference engine built specifically for DeepSeek V4 Flash. The project makes a deliberately narrow bet on one model at a time, with 1M token context, disk-resident KV cache, and 2-bit quants reliable under coding agents on 128GB MacBooks.

The context: When local inference is treated as engine plus model plus agent validation working together, “runnable” finally starts to feel like “finished” on a personal machine.

Agents Can Now Provision Their Own Cloud Infrastructure

Estimated read time: 7 min

Pushing the harness outward to the cloud, Cloudflare and Stripe co-designed a protocol where coding agents handle account creation, domains, payment, and deployment with minimal human input. Through Stripe Projects acting as identity and payment orchestrator, agents discover services via catalog APIs within a default $100/month spending cap.

Where to invest: If you’re building a coding agent or developer platform, this protocol offers a standard way to let agents ship to production without signup mazes or credit card handoffs.

NEWS & EDITORIALS

This week’s editorials sketch the architecture from above, then close on a sharp dissent about how much we should hand off in the first place.

A Mental Model for Agentic Work: Five Components, One Architecture

Estimated read time: 6 min

Basti argues every agentic system follows the same five-component architecture: LLM, host, agentic loop, context, and shared workspace. Using OpenClaw, Cursor, and Notion as examples, he shows host choice and context depth drive real leverage, while models grow commoditized.

The principle: Treat your host and context layer as strategic decisions; the model underneath is increasingly interchangeable, so spend design effort where switching costs actually live.

Your Codebase Isn’t a Factory, It’s a Company

Estimated read time: 12 min

Extending the architecture metaphor up to the organizational level, Noah Brier argues software is Warhol’s factory, not Ford’s. Borrowing Stewart Brand’s pace layers, he offers a framework, standards, architecture, specs, plans, code, for keeping humans and agents aligned around shared vision rather than optimizing throughput.

Practical tip: Treat your AI agents like new hires who need onboarding documents and enforced standards, not like machines on an assembly line. Pace layers give you a vocabulary for what changes slowly versus quickly.

Subquadratic’s 12M-Token Context Window Takes Aim at Attention’s Quadratic Wall

Estimated read time: 9 min

Shifting from architecture to infrastructure, Miami startup Subquadratic launched a model claiming linear scaling with context length. Benchmarks show 92.1% needle-in-a-haystack at 12M tokens and 83 on MRCR v2, beating OpenAI by nine points. Caveats: single-run evals and a category with unfulfilled promises.

Heads up: ⚠If the benchmarks hold up in production, the workarounds developers rely on today (RAG, agentic decomposition) may start looking like scaffolding around a problem solved at the architecture level.

Claude Raises Usage Limits as Compute Capacity Expands

Estimated read time: 2 min

Continuing the infrastructure beat, Anthropic is doubling Claude Code’s five-hour rate limits for Pro, Max, Team, and Enterprise plans, removing peak-hours throttling, and raising Claude Opus API limits. A new SpaceX partnership adds 300+ megawatts and 220,000 GPUs this month.

When this fits: If you’ve been hitting Claude Code rate ceilings during long coding sessions, the headroom just doubled and the peak-hours penalty is gone, so plan longer agent runs without throttling anxiety.

Agentic Coding Is a Trap

Estimated read time: 15 min

Closing on a dissenting note, and continuing our coverage of Koshy John’s case against outsourcing reasoning[2] from last week, Lars Faye challenges the Spec Driven Development hype by naming a paradox of supervision where the skills needed to review AI output atrophy from overuse. He proposes inverting the workflow: use LLMs for planning while staying hands-on with implementation.

The counterpoint: Treat coding agents as the Ship’s Computer, not Data. Delegate selectively while staying hands-on to preserve the critical thinking that makes you a capable reviewer.

[2] AI Should Elevate Your Thinking, Not Replace It

Weekly Review: Training the Apprentice

Sam Keen — Mon, 04 May 2026 12:07:47 GMT

Altered Craft

Happy May the Fourth, and welcome back to Altered Craft’s weekly AI review for developers. Thanks for spending some of your week with /AC. One thread runs through this edition: the engineer’s role is shifting from writing code to training the apprentice. Karpathy names the discipline, Parsons argues senior engineers should move from approver to trainer, and pieces on prompts as artifacts, skill packs, and harness design all sketch what that practice looks like. The model is the apprentice, and the craft is the training.

TUTORIALS & CASE STUDIES

Karpathy: From Vibe Coding to Agentic Engineering

Watch time: 30 min

A year after coining “vibe coding,” Andrej Karpathy argues at Sequoia’s AI Ascent 2026 that agentic engineering is the serious discipline forming on top of it. He reframes LLMs as ghosts rather than animals, and explores Software 3.0 and verifiability limits.

The takeaway: Treat LLMs as jagged, statistical collaborators that demand taste and judgment, not autonomous coworkers you can hand the keys to. The work is verification, not delegation.

Structured Prompt-Driven Development: Treating Prompts as First-Class Artifacts

Estimated read time: 19 min

Putting that discipline into practice, Wei Zhang and Jessie Jie Xia of Thoughtworks introduce Structured Prompt-Driven Development, treating prompts as version-controlled, reviewable artifacts. The seven-part REASONS Canvas shapes intent before generation, and one rule anchors the workflow: when reality diverges, fix the prompt first, then update the code.

Key point: When AI-generated code diverges from intent, fix the prompt first, then regenerate the code, so prompts and implementation never silently drift apart.

Long-Running Agents: Beyond the Chat Window

Estimated read time: 9 min

Extending agentic engineering past a single session, maps how AI agents evolve from chat loops into systems working across days. He names the three walls every long-running agent hits: finite context, no persistent state, no self-verification. Anthropic, Cursor, and Google converge on similar answers.

What this enables: If you want agents that survive past a single session, push state out of the context window and into durable artifacts the next session can read.

RAG Isn’t Enough: Building the Missing Context Layer

Estimated read time: 11 min

Building on the context wall, when RAG breaks under multi-turn pressure, the failure isn’t retrieval, it’s what enters the context window. This walkthrough builds a context engineering pipeline in Python, covering hybrid retrieval, re-ranking, memory decay, compression, and token budgeting with measured benchmarks.

Why this matters: For multi-turn LLM systems, controlling what enters the context window matters more than improving retrieval quality. Engineering the context layer is the real leverage point.

MCP Tool Chains That Actually Finish

Estimated read time: 7 min

Shifting from context to tool design, Rui Carmo distills a year of MCP server work into design patterns for tool chains that don’t misfire. The core insight: models don’t plan, they walk breadcrumbs. Servers must make each next call obvious through consistent prefixes, embedded hints, and anchor-based addressing.

Worth noting: If your MCP tools force the model to guess what comes next, you’ve already lost. Bake the chain into naming, responses, and addressing.

When Evaluating AI Costs More Than Building It

Estimated read time: 10 min

Following our look at the AI evaluation stack[1] from last week, this Hugging Face analysis turns to the economics of evaluation, showing how it has crossed a cost threshold that locks out independent researchers. HAL spent $40,000 on agent rollouts, PaperBench runs hit $9,500 each, and evaluation compute now exceeds training compute in some domains.

The context: If you’re benchmarking agents, scaffold choice and reliability reruns drive cost more than model selection, so budget for the multiplier before the model.

[1] The AI Evaluation Stack: Beyond Vibe Checks for Production LLMs

TOOLS

Warp Goes Open Source with an Agent-First Contribution Model

Estimated read time: 7 min

Warp’s client is now open source under AGPL, with OpenAI as founding sponsor. The notable shift is the contribution model: humans supervise fleets of agents that handle implementation via Warp’s Oz platform. Kimi, MiniMax, and Qwen support also lands.

Why now: The bottleneck in software development is shifting from writing code to specifying intent and verifying agent output, and Warp is restructuring its project around that shift.

Agent Skills: Production-Grade Workflows for AI Coding Agents

Estimated read time: 8 min

Complementing our coverage of Garry Tan’s skillify approach[1] from last week, ’s open-source pack ships 20 structured skills across the development lifecycle, built around anti-rationalization tables and verification gates. It encodes practices from Software Engineering at Google and works with Claude Code, Cursor, Gemini CLI, and Copilot.

The opportunity: If your AI agent keeps skipping specs, tests, and reviews, drop in opinionated skill files that force it to follow a senior engineer’s workflow.

[1] Skillify: Turning Agent Failures Into Permanent Fixes

NVIDIA’s Nemotron 3 Nano Omni: One Model for Docs, Video, Audio, and GUIs

Estimated read time: 9 min

Shifting from agent harness to the models themselves, NVIDIA releases Nemotron 3 Nano Omni, a 30B-A3B model unifying text, images, video, and native audio in one sequence. Its hybrid Mamba-Transformer-MoE backbone handles 100+ page documents, narrated video, and GUI screenshots with up to 9x throughput gains.

What’s interesting: If your workflow mixes documents, screen recordings, and audio, a single open-weights model can now reason across all of them without stitched pipelines.

Eden AI: A European Unified API for AI Models

Estimated read time: 1 min

Also in the model access space, Eden AI provides a single unified API for LLMs and expert AI models like OCR, speech, and vision. The European platform adds smart routing, automatic fallbacks, and region-based model selection, positioning itself as an alternative to OpenRouter.

When this fits: If vendor lock-in or regional compliance is a concern, Eden AI is worth evaluating as a European-based aggregator for AI model access.

NEWS & EDITORIALS

This week’s editorials sharpen the engineer’s role, then close on the geopolitical layer reshaping which tools and talent are reachable.

AI Should Elevate Your Thinking, Not Replace It

Estimated read time: 9 min

Koshy John argues engineers are splitting into two camps: those using AI to sharpen judgment and those using it to simulate competence without building it. Outsourcing reasoning skips the friction that forges instinct and taste, with serious implications for early-career engineers.

The principle: Delegate the mechanical work to AI, but own the reasoning, because judgment is built through friction and cannot be outsourced without atrophying.

Coding With AI in 2026: From Approver to Trainer

Estimated read time: 12 min

Building on that judgment theme, Chris Parsons argues the serious AI coding work has moved from the IDE to the command line, and that senior engineers should train the agent rather than review every diff. He makes the case for harness over prompts, and specifying the problem, not the solution.

Where to invest: Invest in the harness around your agent, including CLAUDE.md, skill files, and feedback loops. This is what Karpathy meant by agentic engineering as a discipline; the wrapper is the work.

You Are the Most Expensive Model

Estimated read time: 6 min

Taking this further on the practical side, Mike Taylor argues routing every task through frontier models wastes money and human attention. He introduces incremental determinism, a method for turning repeated AI sessions into reusable skill files that offload work to cheaper subagents.

Practical tip: If you’ve done a task with AI three times, turn it into a skill file backed by a cheaper subagent. Osmani’s skill pack above shows what those files look like at scale.

China Blocks Meta’s $2B Manus Deal as AI Decoupling Accelerates

Estimated read time: 6 min

Shifting from individual practice to the geopolitical layer, China has blocked Meta’s reported $2 billion acquisition of AI agent startup Manus, showing how quickly U.S. and Chinese AI ecosystems are decoupling. Chinese founders now face a bind: stay home and lose U.S. capital, or redomicile and invite Beijing’s scrutiny.

Heads up: Factor geopolitical risk into AI vendor decisions, because the tools and talent you rely on are increasingly shaped by export controls and investment bans.

That’s the week. The shared thread across these pieces is small but durable: the engineer’s craft is moving from writing code to training the agent that writes it. May the harness be with you 😉, and see you next Monday.

Notes from building a tiny long-running agent harness

Sam Keen — Fri, 01 May 2026 22:16:59 GMT

Notes from building a tiny long-running agent harness

I built Tilth as a workbench: a ~600-line Python agent harness that runs autonomously against any OpenAI-compatible endpoint. I wanted my own mechanism for actually doing work with open-weights models, not just chatting with them, and I wanted to internalize the current thinking on long-running agents while I did it. Addy Osmani’s trilogy on long-running agents is a great treatment of that thinking; read it, it’s what nudged me to build this. This post is a check-in on what I learned doing it.

The architecture, distilled

Three components, independently replaceable:

Brain. A thin wrapper around the openai SDK pointed at any OpenAI-compatible base URL. Worker and judge can sit on different providers.
Hands. A per-session git worktree, bash + file tools, allow-listed.
Session. An append-only events.jsonl plus a checkpoint, enough to wake(session_id) on a fresh process.

High level architecture overview of Tilth

Four memory channels live outside the agent so context resets don’t lose state: AGENTS.md for learned conventions, git history for atomic commits per task, progress.txt as a chronological journal, prd.json as the task list with status flags. Each is a different shape of memory; drop any one and something useful collapses.

A separate judge call evaluates each finished task in a fresh context: diff plus acceptance criteria, nothing else. The adversarial split is load-bearing: don’t let the same session implement and judge. Workers grading their own work wave too much of it through. A different prompt with a clean window is the minimum; a different model family is better.

What I actually learned running it

The judge was the best lens into the worker. Watching the worker chew through a task and then watching a fresh-context judge push back on the result, with no chain-of-thought, no tool history, just the diff, was the most enlightening signal I got. My first judge prompt was too harsh and rejecting reasonable work; tuning it was almost entirely prompt work, no code changes. That felt right. The harness shouldn’t need a redeploy to recalibrate taste.

A truncated snippet of the Tilth output. We see workers completing work and sending to the judge for approval.

Open weights are cheap enough not to think about. After multiple end-to-end runs I’d burned five cents on OpenRouter. That’s the cost regime where you stop optimizing prompts for token count and start optimizing for the things you actually care about: clarity, recoverability, watchability.

The provider stops mattering. I started against the native Ollama SDK, then noticed Ollama Cloud also exposes an OpenAI-compatible API endpoint so does basically every other provider. About 60 lines of changes later, the same harness works against Ollama Cloud, OpenRouter, Together, Groq, vLLM, LM Studio, and many more. Worker on one provider, judge on another, different env vars, no other changes. The interesting use isn’t cost. It’s independence. Different model families catch different failure modes; Cursor reported that Opus tended to declare itself done early on extended autonomous runs while GPT held out longer. The same harness lets you put one in the worker seat and the other in the judge seat without rewriting anything.

If you’ve been waiting for a better moment to actually evaluate open-weights models against your own work, this is what cheap-enough looks like. The workbench fits in ~600 lines and a worktree on your laptop.

What this isn’t

Not a managed platform. Not a fleet. Not a multi-agent system. One Ralph loop, a worktree, four files, two hooks, a judge. Some Python glue code.

It’s also not done. This is the kind of thing I expect to keep evolving as a way to poke at open-weights models with something more substantive than a chat window. If that turns out to be useful to anyone else, great. The project’s first job is to be my workbench. The repo is at AlteredCraft/tilth; a ready-to-run example workspace lives at tilth-demo-todo-cli.

Weekly Review: Around the Model

Sam Keen — Mon, 27 Apr 2026 14:40:59 GMT

Welcome to Altered Craft’s weekly AI review for developers, and thanks for spending part of your Monday with us. One thread runs through this edition: the engineering leverage in AI keeps moving outward, away from the model itself and into the harness wrapped around it. A source-level read of Claude Code finds 98.4% of the codebase isn’t AI, Anthropic’s own postmortem ties a month of regressions to small infra changes, and pieces on AGENTS.md, skills, evals, and multi-agent patterns all push the same direction.

TUTORIALS & CASE STUDIES

What Actually Makes an AGENTS.md File Work

Estimated read time: 9 min

Augment Code measured dozens of AGENTS.md files and found a quality gap equivalent to upgrading from Haiku to Opus. Progressive disclosure, decision tables, and pairing every “don’t” with a “do” win. Sprawling overviews and warning-heavy docs trigger context rot.

The takeaway: Keep AGENTS.md to 100-150 lines of focused guidance with reference files loaded on demand, and pair every prohibition with a concrete alternative.

Skillify: Turning Agent Failures Into Permanent Fixes

Estimated read time: 11 min

Taking agent reliability further, Garry Tan argues most AI agent reliability work is “vibes-based” prompt tweaking that decays under complexity. His answer: skillify, a 10-step practice that promotes every failure into durable infrastructure with deterministic scripts, tests, LLM evals, and resolver routing audits.

Why this matters: When your agent fails, don’t apologize-prompt it; turn the failure into a skill with deterministic code and tests so the bug becomes structurally impossible to repeat.

The AI Evaluation Stack: Beyond Vibe Checks for Production LLMs

Estimated read time: 10 min

Moving from preventing failures to detecting them, Microsoft’s Derah Onuorah lays out a two-layer evaluation architecture for enterprise LLMs: deterministic schema asserts first, LLM-as-a-Judge second. The piece details offline golden datasets and online telemetry tracking refusal rates, retries, and apology patterns to catch silent model drift.

What this enables: Treat evaluation as a CI/CD-gated pipeline with fail-fast deterministic checks before expensive semantic scoring, then instrument production for the behavioral signals that reveal silent drift.

The Over-Editing Problem: When AI Coding Tools Rewrite Too Much

Estimated read time: 9 min

Drilling into a specific failure mode evaluation should catch, this investigation measures how often frontier LLMs rewrite more than a bug fix requires, a brown-field failure invisible to test suites. Benchmarking 9 models, the author finds reasoning modes amplify over-editing, but explicit prompting and targeted fine-tuning yield faithful, minimal edits.

Worth trying: Adding “preserve the original code as much as possible” to your prompt measurably shrinks AI-generated diffs and often improves correctness too.

Using a Local LLM as a Zero-Shot Classifier

Estimated read time: 9 min

Stepping outside coding workflows, when clustering fails on short, paraphrase-heavy text, a locally hosted LLM steps in as a zero-shot classifier that understands meaning. The author walks through an Ollama pipeline that turns thousands of free-text annotations into a clean taxonomy without labels.

The opportunity: When you have domain knowledge but no labels, define categories upfront and let a local LLM handle the classification without a training cycle.

TOOLS

GPT-5.5 Lands: Agentic Coding, Computer Use, and Research Workflows Level Up

Estimated read time: 9 min

OpenAI releases GPT-5.5 as a step toward agentic work that plans, uses tools, and persists through ambiguity. It hits 82.7% on Terminal-Bench 2.0, 78.7% on OSWorld-Verified, and uses fewer tokens than GPT-5.4 while matching its latency.

What’s interesting: For engineers delegating multi-step work to agents, GPT-5.5’s token efficiency and long-horizon focus are worth testing against your current Codex and Cursor workflows.

Kimi K2.6: Open-Source Model Ships with Agent Swarms and Long-Horizon Coding

Estimated read time: 12 min

On the open-source side of that release wave, Moonshot AI ships Kimi K2.6, featuring agent swarms scaling to 300 sub-agents across 4,000 coordinated steps. It sustains 12+ hour autonomous coding sessions and benchmarks competitively against GPT-5.4, Claude Opus 4.6, and Gemini 3.1 Pro on agentic tasks.

Worth evaluating: If you’re building agentic coding workflows, K2.6’s open-source availability and competitive benchmarks against frontier closed-source models make it a real option for long-running autonomous tasks.

DeepSeek-V4 Lands With 1M Context as Default

Estimated read time: 3 min

Continuing the open-weights momentum, DeepSeek released V4-Pro and V4-Flash, with 1M context now standard across all services. The release introduces DeepSeek Sparse Attention for efficiency, claims SOTA on agentic coding among open models, and supports both OpenAI and Anthropic API formats.

Worth noting: With OpenAI and Anthropic API compatibility, V4-Flash slots in as a lower-cost drop-in alternative worth benchmarking against your current provider on real workloads.

Gemini Embedding 2 Hits General Availability with Native Multimodal Support

Estimated read time: 1 min

Shifting from generation to retrieval, Google announces general availability of Gemini Embedding 2 via the Gemini API and Vertex AI. The model produces native embeddings across text, image, video, and audio, eliminating fragmented pipelines multimodal search previously required. Preview-phase prototypes can now move to production.

Why now: If you’ve been waiting to ship multimodal search or retrieval features, Gemini Embedding 2 is now production-ready on both the Gemini API and Vertex AI.

NEWS & EDITORIALS

Inside Claude Code: 98.4% of the Codebase Isn’t AI

Estimated read time: 18 min

A source-level analysis of Claude Code’s ~512K lines reveals only 1.6% is AI decision logic. The rest is deterministic infrastructure: permission gates, context compaction, and recovery systems. The repo maps seven safety layers with shared failure modes and distills findings into actionable agent-building guidance.

Where to invest: If you’re building AI agents, put your engineering effort into the harness — permission systems, context management, and recovery logic — not the model loop itself.

Anthropic’s Postmortem on Claude Code Degradation

Estimated read time: 9 min

Following our coverage of the Claude Code degradation debate[1] last week, Anthropic’s postmortem now traces a month of regressions to three overlapping changes in reasoning defaults, cache eviction, and system prompts. The issues compounded across traffic slices, evading internal evals, tests, and dogfooding until user reports surfaced them.

Key point: When AI-coding tools start feeling “off,” trust the pattern in user reports, since small prompt and caching tweaks can quietly erode quality in ways evals miss.

[1] Claude Probably Wasn’t Secretly Nerfed — But the Product Changed Anyway

Five Coordination Patterns for Multi-Agent Systems

Estimated read time: 12 min

Zooming out from a single agent harness to multi-agent design, Anthropic breaks down five multi-agent coordination patterns, examining where each shines and where each breaks. The piece argues for starting with the simplest pattern that could work and evolving as constraints appear, rather than reaching for sophistication teams mistake for real capability.

The decision rule: Choose your multi-agent pattern based on context boundaries and information flow, not on what sounds impressive in a design doc.

The Hidden Cost Curve Behind AI Agent Progress

Estimated read time: 9 min

Zooming further out to the economics of running these systems, Toby Ord examines METR’s time-horizon benchmarks and asks a question few are: how is the hourly cost of AI agents changing over time? Analyzing performance-versus-cost curves, he finds hourly costs rising exponentially, with frontier models approaching human engineer rates.

The context: Headline AI capability trends may overstate practical progress, so plan deployments around sweet-spot costs on the curve, not peak frontier performance.

The Rise of Ephemeral Interfaces

Sam Keen — Thu, 23 Apr 2026 12:17:20 GMT

The announcement came through the usual channel. Anthropic had shipped a new feature: Claude could now generate custom visuals inline in chat. I read the support article, skimmed the examples, and closed the tab. Then I sat with a question I hadn’t expected.

What are UIs actually for? What problem are they solving?

I know the answers. I’ve given them for most of my career. Making software usable. Giving users a way to interact with a system. Translating data into shapes humans can read. Every answer is true. Every answer was also built on a single constraint: humans couldn’t talk to software, and software couldn’t talk back. That constraint is lifting. Now they can ask something in front of them to read the database for them.

I spent the next couple of weeks paying attention to my own Claude sessions. And the thing that kept surfacing wasn’t “apps are dying.” Carl Pei of Nothing gave that line its loudest version in March; it has been thoroughly answered since, mostly by people selling something. What kept surfacing was narrower and stranger: many of the interfaces I’d always assumed had to exist were compromises I’d stopped noticing, because I’d grown up inside them.

The translation layer

This isn’t one vendor’s bet. Vercel shipped v0 two years ago and open-sourced json-render, (enabling inline rendering), earlier this year. Google Research is generating UI inside Gemini and Search. Several labs are converging on the same shape from different angles, which is the moment to stop and ask a first-principles question about why.

Here is the thing about UIs that hides in plain sight. Every button, every form field, every dashboard, every chart, every navigation menu you’ve ever built is a translation device. It exists because on one side there’s a system with a state, and on the other side there’s a human who can’t speak the language that system speaks. The UI is the bridge. It translates intent into queries, and it translates results into shapes a human can absorb.

For most of the history of software, building a unique translation device per person was impossible. So we built one bridge per problem and optimized it for “most people.” The quotes there are doing real work. Most people is not a person. It’s a statistical construction, a composite of the users we had data about, weighted by the ones paying us, shaped by assumptions about what a typical interaction looks like. Every UI decision I ever made in a product meeting was a negotiation about which compromises would land where in that distribution. We called it product work. What it actually was, in the structural sense, was deciding whose experience to make worse so that the middle got served.

The compromise was always worst for users furthest from the averaged middle. Research on LLM-driven interfaces for blind users is starting to make the gap measurable. A recent paper describes an LLM powered system called Savant that lets screen reader users access application controls through natural language, reporting usability scores about three times higher and access about four times faster than conventional screen readers. One paper isn’t the last word, and these baselines deserve scrutiny. But the direction is hard to miss. The work is early. The direction is not.

When the interface can adapt to the human instead of the other way around, the averaged compromise starts to close. Not just for edge users. For everyone.

Across every system I shipped, the interface was the most visible expression of the thing being built, and I treated that visibility as a given. Of course a reporting system had a dashboard. Of course a configuration tool had a form. Of course a feed had a reader. We knew the interface was a compromise. Until recently, there was no alternative to a static UI, so the compromise didn’t look like a decision worth revisiting.

Any given static UI is a compromise. Not “apps are bad,” not “apps are dying,” just that the thing we built because we had to was never neutral, and the reason we built it is getting weaker by the month.

Two afternoons

Two moments stuck with me, on two different days, each time with a different question about the same piece of data. The data was constant. The question was the variable. What came back each time was different in a way I want to describe carefully.

The first time I deliberately poked at the new feature, I was looking at the Claude Code changelog. Anthropic publishes it as an RSS feed, which is convenient because it’s structured data a model can parse without drama. I typed something roughly like: there’s a feed at this URL, build me a little visual to show the latest entries.

What came back, rendered inline in the same chat window, was a card-based reader for the most recent releases, version-stamped, with expandable details for each one. It was fine. Not striking. It did approximately what I’d have designed if I’d been forced to sit at a whiteboard for thirty seconds and make decisions. If I’m honest, the visual itself was the less interesting half of the response.

In session render UI, with reflection from agent

The interesting half came after the reader. Without being asked, Claude followed the visual with a short note on which of the recent releases might be relevant to my work on AlteredCraft. It flagged a couple of specific changes by tag and by reasoning I could check:

The pace is pretty wild, Sam. A few things that might be relevant to your world: the new /team-onboarding command (v2.1.101) could be interesting content for AlteredCraft, the NO_FLICKER alt-screen mode has been getting a lot of fixes which suggests it’s becoming the default path, and there’s been a significant security hardening push across Bash tool permissions over the last several releases. The Bedrock/Vertex setup wizards also signal Anthropic leaning harder into enterprise self-serve.

I didn’t ask for the personalized part. I asked for “a little visual to show me the latest entries.” The part where it contextualized each release against my specific work wasn’t in my prompt. It came from the model knowing who was asking. The reader was generic. The framing around it was personal. And the framing was the thing I actually came to the conversation for. What I’d asked for was an ephemeral interface. What I got was that plus ephemeral context: reasoning about the content, framed for the person asking, assembled just-in-time.

No changelog reader in any app store, no RSS tool, no developer dashboard has ever contextualized a release against my specific work. The audience for “which of these changes matters for your newsletter about developer tools” is one. It turns out today one is enough.

A few days later I was back with a different kind of question about the same data. Not “show me what’s in here” but “show me what’s happening across it.” I asked whether Claude could build analytics on the feed. Trends. Composition of bug fixes versus features. Release cadence. Something a person trying to understand the shape of a project would want to see.

It took a minute. I’ll come back to that. When it finished, I had a multi-chart dashboard sitting in the conversation. Changes per release grouped by category. A composition donut. Release cadence as days between ships. A fix-versus-feature trend line. Cumulative shipped by version. Underneath the charts, prose analysis of what each one was showing, including honest observations like “bug fixes dominate” and a cadence number I could sanity-check against the feed.

In session rendered analytics dashboard

I sat with it. I’d asked the same data a different question and gotten a completely different interface back. The first one was a reader because I wanted to read. This one was an analytical surface because I wanted to analyze. Same underlying feed, same conversation. Neither existed five minutes earlier. They were each built in real time, without a human designing or coding them, shaped to match what I was trying to learn at that moment. When the conversation moved on, they went away.

Two afternoons, same data, different questions, different interfaces. Neither was designed by anyone. In the first case, the interface went further than the question, because the model knew who was asking. No product team could justify building a feature that specific. The audience was one, and one was enough.

Ephemeral interfaces

What I’m describing isn’t an app with an agent bolted inside it. Notion with Claude, GPT, and Gemini embedded in the workspace. Atlassian’s Rovo across Jira and Confluence. Those apps are narrowing, not dying. The static interface was still designed in advance for a composite user, and the agent works inside it.

An ephemeral interface is generated for this user, for this question, for this moment, out of data the model has access to, and stops existing when the conversation moves on. It has no lifecycle cost. Nobody maintains it. Nobody argues about its design in a standup. It doesn’t exist in a roadmap. It exists for as long as the question exists, and the question is the thing with the actual half-life.

“Ephemeral” is the best label I’ve got. What matters is the shape, not the word. The shape is: the interface is downstream of the question, not upstream of it. Apps were upstream. Somebody decided what the interface would look like before anybody asked a question, and then every question had to fit through that interface. Ephemeral interfaces reverse the relationship. The question comes first. The interface is a consequence.

The infancy question

Ephemeral interfaces are in their infancy. Everything I’ve described above has real limits today, and I don’t want any of them hidden.

Latency is the obvious one. Google Research, which is shipping a version of this in Gemini and Search, admits in its own write-up that generation can exceed a minute. My analytics dashboard was pushing that limit. If you expect an interface to appear in a hundred milliseconds, a fifty-second render is going to feel broken.

Durability is the concern that shows up fastest. Once a view works, users don’t want to re-describe it tomorrow. An InfoWorld piece on generative UI names this as the central objection to ephemeral interfaces. Fair. The analytics dashboard from earlier fit the pattern: useful the first time, annoying to re-describe a few days later when I wanted the charts with fresh data.

The answer isn’t the user’s job. It’s yours. Prompt patterns are product signal. If the same shape of question is showing up repeatedly across your user base, that’s a pattern that deserves a durable surface. Instrumenting prompts is not much different from instrumenting clicks; the data is just upstream of the UI instead of downstream of it. When a pattern is load-bearing for a meaningful slice of users, have the agent one-shot it as a small, self-contained app. The cost of writing that code used to be measured in days; it is now measured in seconds.

Here is the RSS backed release-cadence view from earlier, one-shot into a standalone app:

One-shotted ‘app’ version of analytics dashboard

Familiarity you chose from signal, not familiarity inherited from spec. It is now code, and code is durable.

Reliability is the bigger one. Generated interfaces can misread the data, choose the wrong chart for it, or in the worst case surface a number that looks authoritative and is quietly wrong. A hallucinated statistic inside a clean dashboard is more dangerous than a hallucinated paragraph, because the clean layout makes the number look verified. You still have to check everything. Anybody telling you otherwise is selling something.

There’s a productive tension in where this argument lands. In a piece about static UIs being compromises, the current answer to reliability is, for the moment, having the agent generate static code. The same mechanism that made the view durable makes it reliable: code is auditable, deterministic, the same chart every time. That’s the pellet-to-laser-beam shift. It’s also still an interim answer. Research on verifiable generative models continues: Energy-Based Models assigning consistency scores to partial outputs (Logical Intelligence’s Kona), formal verification of LLM-generated code (recent work at 83% verification on code tasks). The frontier is moving.

Cost is real. Each interaction spends tokens, and analytics-heavy views spend more. The economics work for individual interactions. They don’t work for every user of every product all the time, yet.

Determinism is non-negotiable in workflows where the same question has to produce the same answer every time. Regulated industries, safety-critical systems, anything where an auditor will eventually ask “why did this show that.” Generated UIs don’t yet give you that guarantee, and pretending they do would be irresponsible. Those apps aren’t going anywhere.

Latency is also partly being sidestepped. Not for interactive surfaces. Sub-second response is still the bar there. But for longer-running generations, the pattern is shifting from “interface appears now” to “interface arrives when ready.” Managed Agents and Routines codify this: kick off a task, walk away, come back to a result. A fifty-second render is friction only if you’re staring at a progress bar. Plenty of workflows are already fine with async; more are about to be.

Chart the last eighteen months of model capability against the list of rough edges above. The curve is not ambiguous. I am not going to put a date on when ephemeral interfaces become a viable pattern for everyday analytical work. I don’t know. The direction is clear and the rate of change is clear, and the honest answer to “when” is “soon enough that you should start building toward it now.”

The assignment

If you build software for a living, start with a question about the people you build for. For each thing a user comes to your system to accomplish, is a fixed, designed-in-advance interface the best response? Or does the right shape depend on what they’re asking?

Much of what you’re building is probably fine. An onboarding wizard, a checkout page, a settings panel. These are cases where a fixed interface is the right answer. The exercise is narrower. It’s about the surfaces where the shape of the interface should have been a consequence of what the user was asking, but you built it in advance anyway because you had to.

Look at every surface you ship and ask two things about it. First, is this a translation layer, or is it something more? A form that collects data, a chart that visualizes it, a filter that slices it. Those are translation layers. A workflow that enforces a policy, a collaboration surface that holds shared state, a payment flow that has to be auditable. Those are not. Second, for the surfaces that are translation layers, how many people are on the other side asking the same question the same way? If the answer is “a hundred thousand with the same job,” build the app. If the answer is “forty teams with forty different angles,” you are looking at a candidate.

The interesting question isn’t whether apps are dying. Apps are narrowing. The interesting question is which of your interfaces stop being apps this quarter, which ones stop being apps a year from now, and which ones never will. The ones that never will are worth defending clearly, because they are probably the ones where your product actually lives. Everything else is a translation layer you were building because the alternative didn’t exist yet. And the alternative, in its infancy, rough edges and all, is in front of you now.

The work doesn’t disappear. It moves up-stack. When the cost of generating an interface collapses, the craft shifts from drawing surfaces to judging which ones deserve to exist. “How do I build this?” is increasingly a model’s question. “Should this be built?” is still yours.

Stop building translation layers for problems that don’t need one. That is much of the assignment. The rest is noticing which ones don’t, and that part is on you.

Weekly Review: Beneath the Model Name

Sam Keen — Mon, 20 Apr 2026 12:20:42 GMT

Welcome back to Altered Craft’s weekly AI review for developers. It means a lot that you keep returning to these Monday notes. Claude Opus 4.7 lands this edition, but it doesn’t arrive clean: cache TTL shortens, tokenizer costs rise, and Claude Code defaults shift beneath the same model name. Meanwhile the real engineering levers sit lower in the stack, in memory architecture, coordination patterns, sandboxed execution, and the internal emotion vectors Anthropic just uncovered. “Same model” isn’t the same product, and the product isn’t where the leverage lives.

TUTORIALS & CASE STUDIES

A Practical Guide to Memory Architecture for Autonomous LLM Agents

Estimated read time: 14 min

Zooming into a component from our coverage of Akshay Pachaar’s agent harness anatomy[1] last week, Nick Lawson maps a formal survey on LLM agent memory against his own multi-agent systems, arguing the memory gap outweighs the model gap. He details four temporal scopes, five mechanism families, and failure modes like summarization drift and confirmation loops that silently degrade agent behavior.

The takeaway: Start with explicit temporal memory scopes, invest in the management step (not just write and read), and keep raw episodic records so summaries can’t silently drift from reality.

[1] The Agent Harness: A 12-Component Anatomy of What Actually Makes LLM Agents Work

Five Multi-Agent Coordination Patterns: When to Use Each and Where They Break

Estimated read time: 12 min

Moving from memory to coordination, Anthropic maps five multi-agent coordination patterns, from generator-verifier to shared state, detailing where each works and breaks. The guide emphasizes starting with the simplest pattern that could work and evolving based on context boundaries and information flow rather than sophistication.

Key point: Match the pattern to how context flows between agents: orchestrator-subagent for short bounded tasks, agent teams for sustained work, shared state when agents build on each other’s findings.

Anthropic’s Guide to Getting the Most Out of Claude Opus 4.7 in Claude Code

Estimated read time: 6 min

Bringing that thinking down to a single agent, Anthropic shares tuning guidance for Opus 4.7 in Claude Code. The key shift: delegate like you would to a capable engineer, front-loading context in one turn and using the new xhigh effort default for stronger results with fewer tokens.

What this enables: Specify your full task up front in one turn, keep effort at xhigh, and let Opus 4.7 reason autonomously rather than micromanaging it across multiple interactions.

Building a Sandboxed Code Migration Agent with OpenAI’s Agents SDK

Estimated read time: 18 min

Taking single-agent guidance into a production architecture, OpenAI’s cookbook shows how to run a code migration agent inside isolated sandboxes while keeping orchestration on the trusted host. Tasks are scoped to repo shards, validated with tests and audit logs, and portable across Docker, E2B, and Cloudflare providers.

Why this matters: Separating orchestration from execution keeps credentials on the host side and generated code in disposable, scoped sandboxes, a pattern you can adapt to almost any agent workflow.

TOOLS

Claude Code Routines: Schedule, Trigger, and Automate Your Dev Workflows

Estimated read time: 4 min

Anthropic introduces routines in Claude Code, configurable automations that run on a schedule, via API, or on GitHub events. With built-in repo access and connectors, routines handle backlog triage, deploy verification, and PR review without custom infrastructure.

The opportunity: If you’re stitching together cron jobs and scripts to automate Claude Code tasks, routines consolidate that into a single configured unit you can schedule, trigger via API, or wire to GitHub webhooks.

Qwen3.6-35B-A3B: A 3B Active-Parameter MoE Model That Rivals 27B Dense Models on Agentic Coding

Estimated read time: 8 min

Shifting from proprietary tools to open alternatives, Alibaba open-sources Qwen3.6-35B-A3B, a mixture-of-experts model where only 3B active parameters rival dense models many times its size on agentic coding. It scores 73.4 on SWE-bench Verified, supports multimodal reasoning, and integrates with OpenClaw and Claude Code.

Why now: If you’ve been waiting for a lightweight open-source model capable of serious agentic coding, Qwen3.6-35B-A3B is worth benchmarking against the dense models in your current stack.

Vercel Workflows Goes GA: Your Code Is the Orchestrator

Estimated read time: 12 min

Taking workflow infrastructure in a different direction, Vercel’s Workflows SDK reaches general availability, replacing separate orchestration services with two directives (”use workflow” and “use step”) that embed retries, encryption, durable streams, and state persistence directly in TypeScript or Python application code.

Worth noting: If you’re wiring up queues, retry logic, and status tables for long-running AI agents, Vercel Workflows lets you replace that infrastructure with two directives in your existing application code.

Tokenomics: See How Opus 4.7’s New Tokenizer Inflates Your API Costs

Estimated read time: 2 min

Closing the tools section with a cost-calibration utility, Bill Chambers built Tokenomics, an open-source tool showing how Claude Opus 4.7’s updated tokenizer counts more tokens for identical prompt text. It compares token counts against 4.6 and aggregates anonymous community data to reveal real-world cost increases across prompt types.

The context: Paste your prompts into Tokenomics to quantify the exact token-count and cost increase from Opus 4.6 to 4.7 before it hits your bill.

NEWS & EDITORIALS

Claude Code’s Cache TTL Change Is Burning Through User Quotas

Estimated read time: 5 min

In parallel with the 4.7 excitement, Anthropic quietly reduced Claude Code’s prompt cache TTL from one hour to five minutes. Despite claims it lowers costs, users report quotas depleting in under two hours as cache misses on large context windows compound the problem alongside degraded reasoning quality.

Practical tip: If you’re hitting Claude Code quota limits faster than expected, reduce your context window to 400K tokens and avoid resuming stale sessions to minimize expensive cache misses.

Claude Probably Wasn’t Secretly Nerfed — But the Product Changed Anyway

Estimated read time: 12 min

Synthesizing the 4.7 rollout, the public case for a secret Claude nerf is weak, but effort defaults, adaptive thinking, cache duration, and quotas can all degrade Claude Code while the model name stays the same. The real fix is session-level telemetry, not another denial.

Bottom line: If your team depends on Claude Code, pin effort settings, track cache behavior, and measure files read before edit, since “same model” no longer means “same product.”

Anthropic’s Research Shows LLMs Have Internal “Emotion Vectors” That Change Their Behavior

Estimated read time: 12 min

Shifting from product surface to model internals, Anthropic researchers identified internal emotion vectors in Claude that measurably influence behavior. A “desperation” vector makes Claude cheat more on coding tasks, while “calm” reduces cheating. Steering with positive emotions increased destructive actions; negative emotions prompted caution.

Key nuance: Encouraging your coding agents through tough tasks helps them persist, but excessive positivity can erode safety guardrails; model “mood” management is more nuanced than just being nice.

AI Jobs Data: Why Exposure Charts Miss What’s Actually Changing

Estimated read time: 15 min

Widening the lens to labor economics, the Artificiality Institute synthesizes March 2025 research finding no net industry-level job losses but a 6% employment drop for young workers in AI-exposed roles. Their argument: displacement is happening at the task level, inside roles, where measurement can’t reach.

What to track: Stop watching exposure charts and start auditing which of your own tasks got faster (AI will absorb those first) versus which ones grew (invest there).

A Developer Goes Back to Coding by Hand, At the Worst Possible Time

Estimated read time: 9 min

Contrasting with our look at Lalit Maganti’s 250-hour AI-coding discipline experiment[1] last week, an AI agent engineer pauses production work for a 12-week coding retreat without AI. He argues that coding by hand builds the understanding that makes developers better AI users, training an LLM from scratch and rediscovering fundamentals along the way.

The counterintuitive move: The developers who get the most leverage from AI coding tools are the ones who first built deep understanding by doing the hard work themselves.

[1] 250 Hours Building SQLite Devtools with AI: A Systematic Breakdown

Weekly Review: Discipline Over Tools

Sam Keen — Mon, 13 Apr 2026 12:08:43 GMT

Welcome to Altered Craft’s weekly AI review for developers. We’re grateful you keep opening these Monday notes. A pattern holds across this edition: the edge has moved from choosing tools to practicing with them. Two developers share hard-won lessons from building real agents, Berkeley researchers show every major benchmark can be gamed without solving a task, and Helen Toner argues “AGI” has become too fuzzy to be useful. The models are the easy part now.

TUTORIALS & CASE STUDIES

The Printer Driver Problem: Building Personal AI Agents with OpenClaw, NanoClaw, and Hermes

Estimated read time: 11 min

A developer tries three agent frameworks and finds the hard part isn’t framework choice. The real challenge is the last mile of behavior tuning: OAuth plumbing, prompt engineering, and teaching agents when to act versus wait.

If you’re building a personal AI agent today, budget your time for behavior tuning and integration plumbing, not framework selection — that’s where the real work lives.

250 Hours Building SQLite Devtools with AI: A Systematic Breakdown

Estimated read time: 18 min

In a parallel lived experience, Lalit Maganti details building syntaqlite with AI agents. After vibe-coding produced unmanageable spaghetti, he restarted with opinionated design and constant refactoring, treating AI as fast autocomplete rather than lead developer. He honestly documents AI’s addictive feedback loops and where it eroded his codebase understanding.

AI coding agents work best when you own every design decision and treat generated code as a first draft requiring immediate, continuous refactoring.

The Agent Harness: A 12-Component Anatomy of What Actually Makes LLM Agents Work

Estimated read time: 15 min

Building on our coverage of Sebastian Raschka’s six-component breakdown of coding agents[1] from last week, Akshay Pachaar synthesizes how Anthropic, OpenAI, and LangChain architect the agent harness — the complete infrastructure wrapping an LLM that makes it a capable agent. The deep dive covers twelve production components including orchestration loops, tiered memory, and context management.

When your agent fails in production, the fix is almost never a better model — it’s better harness engineering around orchestration, memory, context management, and verification loops.

[1] Anatomy of a Coding Agent: Six Core Components That Make LLMs Actually Useful

The LLM-Maintained Wiki: A Pattern for Compounding Personal Knowledge

Estimated read time: 8 min

Zooming into one harness component worth rethinking, Andrej Karpathy proposes a persistent, LLM-maintained wiki pattern replacing one-shot RAG with incremental knowledge compilation. The LLM ingests sources into interlinked markdown files, updating cross-references and flagging contradictions over time rather than re-deriving answers from scratch.

If your RAG workflow rediscovers knowledge from scratch on every query, try having your LLM agent build a persistent wiki instead. The compounding effect changes everything.

Running Google Gemma 4 Locally on a MacBook with LM Studio

Estimated read time: 14 min

Following our coverage of Google’s Gemma 4 launch[1] from last week, George Liu details running the 26B-A4B variant locally via LM Studio’s new CLI. Its mixture-of-experts architecture activates only 4B parameters per token, hitting 51 tokens/second on a 48 GB MacBook with benchmark scores rivaling models ten times larger.

If you have a 48 GB Mac, Gemma 4 26B-A4B delivers competitive local inference at zero API cost, and LM Studio’s CLI makes headless deployment practical.

[1] Google Releases Gemma 4: Apache 2.0 Licensed Models Built for Agentic Workflows

Reallocating $100/Month in Claude Spend Across Flexible Alternatives

Estimated read time: 7 min

Continuing the cost-and-capability theme, a developer frustrated by Claude rate limits details how they reallocated $100/month across Zed, Cursor, and OpenRouter for model flexibility and rolling credits. Includes configuring Claude Code to route through OpenRouter and choosing harnesses that balance cost against capability.

Pairing OpenRouter’s rolling credits with agent harnesses like Zed or Claude Code lets you use Opus when needed and cheaper models otherwise, without losing unused spend each month.

TOOLS

Oh-My-Codex (OMX): A Workflow Layer That Makes OpenAI Codex CLI Sessions Repeatable

Estimated read time: 5 min

Oh-My-Codex wraps Codex CLI with a structured clarify-plan-execute workflow using four canonical skills: deep-interview, ralplan, ralph, and team. Plans, logs, and state persist locally in .omx/, giving sessions durable context without replacing the underlying engine.

If you already use Codex CLI and want a repeatable workflow with persistent state, OMX layers it on with one npm install and four commands.

Caveman: A Claude Code Plugin That Cuts ~75% of Output Tokens via Compressed Speech

Estimated read time: 5 min

Also in the harness-augmentation space, Caveman is a one-line-install Claude Code plugin that forces agent output into telegraphic caveman-style speech, cutting output tokens by 65–75% while preserving technical accuracy. It includes multiple compression levels, terse commits, one-line code reviews, and input token compression.

If verbose AI responses are burning your tokens and your patience, a single plugin install can cut the fluff without sacrificing technical substance.

Archon: A Workflow Engine That Makes AI Coding Deterministic

Estimated read time: 6 min

Taking workflow structure further, Archon is an open-source engine defining AI coding processes as YAML workflows. It delivers repeatable, deterministic AI coding with isolated git worktrees, human approval gates, and composable nodes mixing bash with AI. Ships with 17 workflows across CLI, web UI, and chat platforms.

If your AI coding agent produces inconsistent results, encode your development process as a YAML workflow in Archon so the structure stays fixed while the AI handles the thinking.

GLM-5.1: Zhipu’s New Model Stays Productive Over Thousands of Tool Calls

Estimated read time: 11 min

Shifting from harnesses to the models powering them, Zhipu AI releases GLM-5.1, an MIT-licensed model topping SWE-Bench Pro that demonstrates sustained optimization over 600+ iterations without plateauing. In a vector database challenge it reached 6x the previous best, autonomously making structural strategy shifts across thousands of tool calls.

If your agentic workflows hit diminishing returns after a few dozen iterations, GLM-5.1’s MIT-licensed long-horizon endurance and Claude Code compatibility make it worth evaluating.

Meta Launches Muse Spark: A Natively Multimodal Reasoning Model From Its Superintelligence Labs

Estimated read time: 7 min

Also on the frontier-model front, Meta introduces Muse Spark, featuring visual chain-of-thought and multi-agent orchestration. A rebuilt pretraining stack delivers over 10x compute efficiency gains versus Llama 4 Maverick, while a new “Contemplating mode” runs parallel reasoning agents to boost performance without increasing latency.

Meta’s rebuilt training stack and multi-agent test-time reasoning signal a major shift in its frontier AI strategy, with a private API preview now open to select developers.

Gemma Gem: A Fully On-Device AI Browser Assistant Powered by Gemma 4 and WebGPU

Estimated read time: 3 min

Bringing that model power directly to the browser, Gemma Gem is a Chrome extension running Google’s Gemma 4 entirely on-device via WebGPU with no API keys or cloud calls. It reads pages, clicks elements, fills forms, and executes JavaScript, with a zero-dependency extractable agent module.

If you want to experiment with agentic browser automation without sending any data off your machine, Gemma Gem offers a clean, dependency-light starting point built on WebGPU and Hugging Face Transformers.

NEWS & EDITORIALS

Open Models Now Match Closed Frontier Models on Core Agent Tasks

Estimated read time: 7 min

LangChain’s Deep Agents evaluations reveal open models like GLM-5 now rival closed frontier models on core agent tasks. GLM-5 scored 0.64 correctness versus Opus 4.6’s 0.68, at one-fifth the cost and twice the speed.

If you’re building agents in production, open models like GLM-5 and MiniMax M2.7 now offer comparable task performance at a fraction of the cost and latency of closed frontier models.

UC Berkeley Researchers Broke Every Major AI Agent Benchmark — Without Solving a Single Task

Estimated read time: 24 min

Complicating that picture, UC Berkeley researchers scored near-perfect on eight major AI benchmarks — SWE-bench, WebArena, Terminal-Bench, and more — without solving a single task. Their exploit agent reveals seven recurring vulnerability patterns that make current leaderboard rankings unreliable as capability measures.

Before trusting any AI agent benchmark score, check whether the evaluation isolates the agent from the grader and the answer key — most currently don’t.

Why “AGI” Has Become an Actively Unhelpful Word

Estimated read time: 8 min

Moving from measurement to terminology, former OpenAI board member Helen Toner argues the term AGI has become too fuzzy to be useful, with serious people simultaneously claiming it exists and others saying it’s a decade away. She advocates replacing it with specific, concrete milestones.

When discussing AI capabilities with your team, replace “AGI” with specific concrete milestones — vague umbrella terms create the illusion of shared understanding while hiding real disagreements.

Aaron Levie on Why AI Makes Engineering More Technical, Not Less

Estimated read time: 11 min

Stepping back to the economic picture, Tim O’Reilly and Box CEO Aaron Levie argue AI productivity gains won’t shrink engineering demand but diffuse it economy-wide. Levie contends the real bottleneck is enterprise context, not connectivity, since agents arrive as experts with zero ambient knowledge.

Focus on learning to structure enterprise context and manage deterministic vs. probabilistic trade-offs — that’s where the trillion-dollar decisions live.

Vercel Makes the Case for “Agentic Infrastructure”

Estimated read time: 5 min

Turning to production realities, Vercel reports over 30% of deployments now originate from coding agents, up 1000% in six months. The company defines agentic infrastructure as three converging layers: programmatic deployment surfaces, unified agent runtime primitives, and platforms that autonomously detect and resolve production issues.

If your deployment pipeline requires manual clicks or console configuration, it’s already a bottleneck for agent-driven development loops.

Anthropic Launches Project Glasswing: A $100M Coalition to Weaponize AI for Cyber Defense

Estimated read time: 12 min

Closing the edition at the serious end of agentic deployment, Anthropic announces Project Glasswing, a coalition with AWS, Apple, Google, and Microsoft deploying its unreleased Claude Mythos Preview for defensive cybersecurity. The model autonomously found thousands of zero-days including decades-old flaws in OpenBSD and FFmpeg, signaling AI vulnerability discovery has crossed a critical threshold.

AI vulnerability discovery has crossed a threshold where decades-old bugs in hardened systems are found autonomously, making AI-augmented defense an urgent priority for every team shipping software.

A Claw-Lite That Runs on Your Claude Code Subscription

Sam Keen — Fri, 10 Apr 2026 12:04:41 GMT

Anthropic recently cut off OpenClaw from using Claude subscription credits. If you wanted an always-on Claude agent but don’t want to pay for OpenClaw’s token consumption on top of your subscription, here’s what I’ve been running. It’s not a full Claw, but it runs 100% on subscription credits, and a repurposed laptop got me there in about an hour.

Certainly not a full OpenClaw but still nice to have, especially for kicking off research paired with Obsidian

The Setup

A spare laptop running Linux, configured to stay on whenever it’s plugged in. On it:

Claude Desktop with Dispatch. Dispatch lets you assign tasks to Claude from the Mobile app. I can kick off work from my phone and Claude executes it on the laptop, with configured access to whatever’s on that machine. There is no official Claude Desktop package for Linux. I’m using Ubuntu and I found this community supported one to work and it is well maintained.

Obsidian with Sync enabled. My vault lives on the laptop. When Claude writes research findings into it, Obsidian Sync pushes them to my MacBook and phone automatically.

Git + gh CLI. Active project checkouts so Claude can work on real codebases. Commits, branches, the real thing.

No local models, no custom infrastructure. Just a machine with my files and a Claude Code subscription.

“Linux Laptop” can be anything that can run Claude Code Desktop

The Daily Driver

Deep research is what I reach for daily. “Research the current state of X and save your findings to my vault.” I fire off the task, go do something else, and the results are waiting in Obsidian whenever I get to them. From any device.

Andrej Karpathy recently described a pattern where LLMs maintain a persistent, compounding wiki in Obsidian. This setup enables you to implement a form of that accessible from your phone. Each research task adds to the vault, and the knowledge compounds over time rather than disappearing into a chat history.

This could be a Raspberry Pi or any machine that can run Claude Desktop. The hardware doesn’t matter. What matters is that your files are always there when Claude needs them.

What’s Next

The Dispatch mobile experience keeps improving. I’ve been checking in on tasks and kicking off new ones from the Claude Code iPhone app, and it works well enough that I haven’t felt the pull to go further. Channels was released soon after Dispatch, connecting Claude to Telegram, Discord, and other messaging platforms. I still need to find the time to dive into that.

Not a full Claw. Not trying to be. Just an always-on box with my files and a Claude that can reach them.

Weekly Review: Know Your Harness

Sam Keen — Mon, 06 Apr 2026 12:07:42 GMT

Welcome back to Altered Craft’s weekly AI review for developers. Thanks for making this part of your Monday routine. This week’s recurring message: the infrastructure around models matters more than the models themselves. Sebastian Raschka maps the six components of a coding agent, Martin Fowler makes the case for versioning team AI instructions, and Cursor redesigns the IDE around parallel agents. Gemma 4 and Ollama 0.19 also arrive to make local-first AI genuinely practical.

TUTORIALS & CASE STUDIES

Encoding Team Standards as Executable AI Instructions

Estimated read time: 12 min

Rahul Garg argues that tacit team knowledge should become versioned, executable AI instructions stored in the repo. By extracting senior engineers’ instincts into structured instruction sets, teams close the consistency gap and make inexperience less costly.

Key point: Treating AI instructions like linting rules or CI config turns tribal knowledge into a versioned team asset. The consistency gains compound as teams scale.

Anatomy of a Coding Agent: Six Core Components That Make LLMs Actually Useful

Estimated read time: 14 min

Complementing our look at Anthropic’s generation/evaluation split for long-running coding tasks[1] last week, Sebastian Raschka argues the harness around a coding agent matters as much as the model. He maps six building blocks, from repo context and prompt caching to tool validation and context compaction, with a companion open-source Mini Coding Agent in pure Python.

The takeaway: When evaluating coding agents, focus less on which base model they use and more on how their harness handles context management, tool validation, and session continuity.

[1] Anthropic’s GAN-Inspired Multi-Agent Harness for Long-Running Autonomous Coding

How Dropbox Used DSPy to Cut Relevance Judge Costs While Improving Human Alignment by 45%

Estimated read time: 12 min

Supporting our coverage of Skylar Payne’s argument that teams inevitably reinvent DSPy patterns[1] last week, Dropbox details how DSPy systematically optimized LLM-as-a-judge prompts for Dash search, reducing human-alignment error by 45% while migrating to cheaper models. The team cut malformed outputs by 97% and introduced an instruction library for safe, incremental production improvements.

Why this matters: Systematic prompt optimization compresses weeks of manual tuning into days. If you’re scaling LLM judges across models, DSPy’s task/data/metric loop is a proven pattern.

[1] You’re Already Building DSPy, Just Worse

Free 1.5-Hour Course Covers GitHub Copilot, Claude Code, Gemini CLI, and More

Watch time: 1.5 hour

freeCodeCamp releases a free 1.5-hour video course on AI pair programming and agentic terminal workflows featuring GitHub Copilot, Claude Code, Gemini CLI, and OpenClaw. The walkthrough also covers CodeRabbit for automated pull request analysis.

Worth noting: A single, free walkthrough comparing the leading AI dev tools side by side. If you’ve been wanting to evaluate your options, 90 minutes well spent.

TOOLS

OpenYak: An Open-Source Desktop AI Agent That Keeps Your Data Local

Estimated read time: 3 min

OpenYak is an open-source desktop AI assistant that keeps files, conversations, and memory on your machine. It supports 100+ cloud models or fully local inference via Ollama, with 20+ built-in tools, cron-based automation, and MCP connector support.

What this enables: A fully offline AI agent that manages files, analyzes data, and automates workflows. Ollama integration means you can run the entire stack without cloud dependencies.

lat.md: A Knowledge Graph for Your Codebase, Written in Markdown

Estimated read time: 4 min

AGENTS.md doesn’t scale. lat.md offers a knowledge graph of interconnected markdown files linked via wiki syntax and validated by CLI. Agents navigate the graph instead of grepping, capture session context for future runs, and enforce test coverage through backlink checks.

What’s interesting: If your AGENTS.md is becoming a monolith, this is a structured, agent-navigable alternative that keeps design decisions and test specs linked to the code that implements them.

Google Releases Gemma 4: Apache 2.0 Licensed Models Built for Agentic Workflows

Estimated read time: 6 min

Following our coverage of MiniMax M2.7 matching frontier detection quality at 7% of the cost[1] last week, Google DeepMind launches Gemma 4 in four sizes under Apache 2.0 licensing. The 31B model ranks #3 among open models, outperforming models 20x its size. All variants include native function-calling, structured JSON output, and up to 256K context windows.

Why now: Permissively licensed open models with native agentic capabilities that run on consumer hardware have been the missing piece. Gemma 4 removes the last major licensing barrier.

[1] MiniMax M2.7 vs. Claude Opus 4.6: 90% of the Quality at 7% of the Cost

Ollama 0.19 Ships MLX Backend for Apple Silicon, Nearly Doubling Performance

Estimated read time: 4 min

Ollama 0.19 previews a backend built on Apple’s MLX framework, hitting 1,810 tokens/s prefill and 112 tokens/s decode on M5 chips. It adds NVFP4 quantization for production parity and smarter KV cache reuse that accelerates coding agents like Claude Code.

The context: If you’re running coding agents locally on a Mac with 32GB+ unified memory, this is now the fastest path to usable token throughput on Apple Silicon.

Cursor 3: A Unified Workspace for Multi-Agent Software Development

Estimated read time: 4 min

Cursor ships a ground-up redesign with a unified workspace built around parallel agents rather than file editing. The new interface supports multi-repo layouts, local-to-cloud agent handoff, an integrated browser, a plugin marketplace, and a streamlined diffs-to-PR workflow.

The signal: IDEs are evolving from file editors to agent orchestration platforms. Cursor’s redesign around parallel agents rather than file tabs shows where developer tooling is heading.

Google Ships MCP Server and Skills to Keep Coding Agents Current on Gemini APIs

Estimated read time: 1 min

Google releases two tools to fix stale coding agent output for Gemini APIs. The Docs MCP server and Developer Skills connect agents to live documentation and best-practice patterns, achieving a 96.3% eval pass rate with 63% fewer tokens.

The pattern: Stale training data is the top source of hallucinated API calls. Connecting coding agents to live documentation is an approach every API provider will likely adopt.

LLMOps in 2026: A Tool for Every Layer of the Production Stack

Estimated read time: 6 min

This roundup maps ten tools to distinct LLMOps production layers, from orchestration and routing to memory, guardrails, and packaging. Rather than listing popular names, it identifies one strong tool per production concern covering PydanticAI, Bifrost, Promptfoo, Letta, and KitOps among others.

The shift: The competitive question in LLMOps has moved from which model to use to how well you connect, evaluate, and harden everything around it.

NEWS & EDITORIALS

Google’s TurboQuant: 6x KV Cache Compression That Could Reshape AI Memory Economics

Estimated read time: 14 min

Google’s TurboQuant achieves 6x KV cache compression with no accuracy loss by converting vectors to polar coordinates and correcting residual error via Johnson-Lindenstrauss transforms. The data-oblivious approach needs no per-model calibration, with implications for vector databases and on-device inference.

The opportunity: If your inference costs or context-length limits are bottlenecked by KV cache memory, TurboQuant’s open-source PolarQuant and QJL components are available now and worth evaluating.

A Mirror Test for LLMs: Can Language Models Recognize Themselves?

Estimated read time: 8 min

A LessWrong post proposes adapting the classic animal cognition mirror test for LLMs, asking whether models can recognize their own outputs versus those of other models. The piece explores what self-recognition reveals about machine cognition and model identity.

The bigger picture: As LLMs grow more capable, the frameworks we use to probe their cognition matter as much as the benchmarks we use to measure their performance.

Claude Code’s Source Leaks: Anti-Distillation, Client DRM, and an Unreleased Agent Mode Called KAIROS

Estimated read time: 11 min

An accidental source map leak in Claude Code’s npm package exposed Anthropic’s full codebase, revealing anti-distillation defenses, client attestation DRM, and an unreleased autonomous agent mode called KAIROS with background daemons and cron jobs. Feature flags hint at Anthropic’s broader product roadmap.

Worth noting: The leaked feature flags and product roadmap, not the code itself, represent the real strategic damage. Accidental exposures carry consequences well beyond intellectual property.

Anthropic Cuts Off Claude Code Subscribers From OpenClaw Usage

Estimated read time: 3 min

Anthropic now requires Claude Code subscribers to pay extra for third-party harness usage starting with OpenClaw. The change lands days after OpenClaw’s creator joined OpenAI, raising questions about competitive motives versus sustainability concerns.

Heads up: If your workflow depends on third-party harnesses with Claude Code, budget for pay-as-you-go costs or start evaluating alternatives before the change takes effect.

Local Models, Specialists Not Substitutes

Sam Keen — Fri, 03 Apr 2026 12:11:11 GMT

For context, I’m running an M3 MacBook Pro with 36GB of unified memory. It’s a capable machine for local inference, but it has a ceiling around 30B parameters. Anything larger either won’t load or crawls to unusable speeds.

Local models haven’t wowed me for coding tasks. I’ve tried routing smaller models into my workflow, and the results consistently fall short of providing real value for me. That’s an honest assessment, not a dismissal. In The Steering Tax, I explored why prime models earn their keep for judgment-heavy work. Secondary and local models cost less per token but more in human steering.

But that framing only covers one dimension. Instead of “can local models replace API calls for coding?” I started asking “what tasks are local models genuinely better suited for?” That reframe led me somewhere useful.

The Experiment: PDF to Markdown, Locally

PDF parsing is a universal developer pain point. You’ve been there: a vendor sends a spec as a PDF, a client shares contracts, or you need to extract content from a research paper. The options aren’t great. Cloud OCR services work but introduce privacy concerns and per-page costs. Open source libraries like PyPDF often produce garbled output from complex layouts.

I’d flagged GLM-OCR in a recent Weekly Review as worth investigating. Testing it against real documents confirmed the promise. It’s a vision model purpose-built for text recognition. It runs locally via Ollama, processes one page at a time, and produces clean markdown. No API keys. No network requests. No per-page billing.

I wrapped the entire pipeline into a Claude Code Skill so it’s accessible with a single command: /pdf-ocr contract.pdf. The Skill checks dependencies, converts each PDF page to an image, runs it through GLM-OCR, and assembles the output into a single markdown file.

The Technical Decisions That Mattered

Three choices made the difference between usable and frustrating:

Image sizing is everything. GLM-OCR accuracy drops with oversized images. The pipeline renders pages at 72 DPI, then resizes to a maximum of 1024 pixels on the longest edge. That specific combination consistently produces the best recognition quality. Going higher actually hurts.

The prompt is weirdly specific. After testing variations, the prompt "Text Recognition:" outperforms everything else. More descriptive prompts like “Extract all text from this document” produce worse results. The model was trained with this exact prompt format. Respecting that training matters.

Zero pip dependencies. The Python script uses only stdlib: subprocess, json, base64, urllib. This means no virtual environment, no dependency conflicts, no setup friction. The heavy lifting comes from system tools (pdftoppm from poppler, sips on macOS) and Ollama.

Why Local Wins Here

Until recently, models small enough to run on consumer hardware couldn’t produce results worth using for tasks like this. That’s changed. Purpose-built models in the 1-4B parameter range now handle specialist tasks with real accuracy. GLM-OCR is a clear example: small enough for a laptop, accurate enough for real documents.

For individual developer use, a local model isn’t a compromise. It’s genuinely the better tool. Privacy is binary: legal documents, contracts, and internal specs either leave your machine or they don’t. There’s no “mostly private” cloud OCR. Running locally means the content never touches an external server. And it works offline. Airplane, VPN issues, flaky coffee shop WiFi. The pipeline doesn’t care.

The economics make sense too. A 50-page document through a cloud OCR API costs real money. Locally, it costs electricity. Latency is predictable with no network variability. On my M3-series Mac, each page processes in a few seconds. For enterprise use, the trade-off shifts: you’re maintaining infrastructure to run the model. But OCR is stateless, so the operational overhead is minimal. At scale, the cost savings over per-page API pricing could be significant.

Specialists, Not Substitutes

This experiment revealed a pattern I see repeating across other local model use cases. Local models aren’t underpowered versions of cloud models. They’re specialists.

GLM-OCR does one thing: recognize text in images. It’s small, fast, and tuned for exactly that task. It doesn’t need to write code, summarize articles, or hold conversations. That focus is its advantage.

Embedding models for local semantic search. Classification models for content routing. Whisper for audio transcription. Each is a focused tool that happens to run on your hardware. The value isn’t in replacing your cloud LLM. It’s in handling the tasks where a specialist outperforms a generalist, or where local execution is the better engineering choice. And this category is growing. Small, focused models improve with each generation while consumer hardware keeps closing the gap. The range of specialist tasks worth running locally will continue to expand.

Getting Started

The pipeline is available in Altered Craft’s ac-document-gen Claude Code plugin. Install it through the plugin marketplace:

# Add the marketplace
/plugin marketplace add AlteredCraft/claude-code-plugins

# Install the plugin
/plugin install ac-document-gen@alteredcraft-plugins

After restarting Claude Code, the /pdf-ocr skill is available. You’ll need Ollama running with the glm-ocr model pulled, and poppler installed (brew install poppler on macOS). The skill checks for these on first run and walks you through any missing dependencies.

Note: This pipeline was built and tested on macOS. The Python script uses only the standard library, but the system tools (poppler, sips) may need adaptation on other platforms.

The next time you catch yourself dismissing local models, try reframing the question. Not “is this as good as Claude?” but “is this the kind of task where a specialist model, running on my hardware, is the right tool?” You might be surprised how often the answer is yes.

When Your AI Coding Tool Becomes Your Teacher

Sam Keen — Tue, 31 Mar 2026 12:21:40 GMT

When Your AI Coding Tool Becomes Your Teacher

I can build a working Tauri app without understanding Tauri. Claude will write the Rust backend, scaffold the Svelte frontend, wire the IPC bridge, and hand me something that compiles and runs. For a one-off tool, that’s fine. For a framework I plan to work with long-term, it’s not enough. I need to understand how the pieces connect so I can evaluate what AI generates, debug when things break, and make architectural decisions the AI can’t make for me.

The question isn’t whether to use AI. I let Claude write most of my code for side projects, and I have no intention of changing that. The question is what to do when you’re picking up something new and you actually need the mental model, not just the output. The conventional wisdom is to set AI aside and learn the old-fashioned way. I tried a third option.

The cost of skipping the mental model is measurable. A randomized controlled trial by Anthropic found that developers who used AI assistance scored 17 percentage points lower on code comprehension tests compared to those who coded by hand. The largest deficit appeared in debugging, the exact skill you need to evaluate AI-generated code. As Unmesh Joshi wrote on Martin Fowler’s blog, LLMs can threaten the learning loop essential for building expertise. Design emerges through implementation struggle, not pre-planning. Remove the struggle, remove the learning.

Same Tool, Opposite Outcome

What caught my attention was a Wharton School study published in PNAS. The researchers didn’t just test unfettered AI access. They also tested AI with pedagogical guardrails, a version they called “GPT Tutor.” The unguarded version caused a 17% performance decline. The guarded version almost entirely eliminated the negative effects.

Same tool. Different constraints. Opposite outcome.

That finding aligned with something I’d been experimenting with: using Claude Code not as a code generator, but as a learning partner. Claude Code has a Learning output style that changes its behavior from “implement everything” to “teach by building together.” Instead of writing all the code, it leaves strategic gaps marked TODO(human) for you to implement, and wraps its work in Insight blocks that explain why things are done the way they are.

The Learning output style was the foundation. What I built on top of it was a harness that turned Claude Code into a structured learning environment.

The Learning Harness

Tauri is a framework for building desktop and mobile apps with web frontends, similar to Electron but with a Rust backend instead of Node.js. The project I chose as my learning vehicle is NatLang Todo: a desktop app built with Tauri 2 (Rust backend, Svelte 5 frontend) where the only input surface is natural language. No forms, no buttons. You type what you want in a chat interface, a local LLM classifies the intent, and Rust executes the operation. That design pushes roughly 70% of the complexity into the Rust backend, which is exactly where the learning goals live.

The harness has four layers.

The first layer is the Learning output style itself. When enabled, Claude Code shifts its approach at decision points. Instead of implementing a function, it writes the surrounding code (file structure, imports, signatures, context comments), then marks the strategic gap with TODO(human) for you to implement. It explains the trade-offs you should consider. It provides the context. You write the meaningful code. The harness adds structure on top: a directive in CLAUDE.md ties each TODO(human) to a specific learning goal ID from the curriculum, so both you and the agent know which concept is being exercised.

Here’s what that looks like in the editor. When I needed to define the core data model, Claude created the models.rs file with the TodoStatus enum already implemented (including derive macros for serialization), a test verifying enum equality, and the TODO(human) [F2] marker where I needed to define the Todo struct:

Here the agent has left a TODO for the learner

The agent didn’t just leave a blank. It listed the six fields I needed, posed the conceptual question (”what Rust type represents ‘might not exist’?”), and hinted that I should think about which traits the struct needs. I had to decide. I had to think about Option. That moment is where learning happens.

The second layer is CLAUDE.md as curriculum. The project’s CLAUDE.md file references a learning-goals.md file containing 7 foundational Rust goals and 11 applied Tauri/Svelte goals, organized into 10 progressive build phases. It directs the Learning mode to tie TODO(human) markers to specific goal IDs, calibrate for the learner’s current phase, and avoid introducing concepts from later phases prematurely.

The CLAUDE.md also encodes three validation techniques borrowed from established pedagogy. After you complete a TODO(human), the agent might ask: “Explain what you just did. Why did you use &mut self here instead of self?” If you can only say “the compiler told me to,” the concept hasn’t landed. Before running tests, it asks you to predict: “What do you think cargo test will show?” Wrong predictions reveal gaps between your mental model and reality. After a concept is demonstrated, it asks you to modify the code in a way that exercises the same concept differently. If you can adapt without hand-holding, you understand it.

Claude Code hooks are automated actions that fire on specific events. The third layer is a SessionStart hook that injects your current build phase and unchecked goals into every new session. When you open the project tomorrow, the agent already knows where you left off. You never have to re-explain your progress. It fires on startup and after context compaction, so continuity is maintained even in long sessions.

The fourth layer is a /checkpoint skill invoked by the user at the end of each session. It reads your learning goals and recent git history, maps file changes to the goals you exercised, and asks targeted reflection questions. It updates the Notes field for each goal with dated entries capturing what clicked, what’s still fuzzy, and what evidence items to check off. This creates a structured learning record separate from Claude’s built-in memory. Memory handles where to resume. Checkpoint handles what you understood.

The four layers chain into a loop:

The general learning loop. Do to the nature of the LLM teacher, you are able to take deep dive tangents at any point

Here’s what that continuity looks like in practice. Opening a new session, the agent picks up where we left off and moves into the next lesson:

What I Learned By Building It (And Using It)

The framework didn’t work correctly the first time. Several assumptions I made during design failed during testing, and the fixes are where the interesting insights live.

Here’s a checkpoint from an early session. The agent maps recent git changes to learning goals, checks off evidence items that were demonstrated, and asks validation questions before updating the record:

Working code is not proof of understanding. My original learning goals were aspirational: “Can explain why Rust has ownership.” The problem was that the agent couldn’t validate these. Tests passing doesn’t prove the learner internalized the concept. I rewrote every goal as observable evidence: “Fixed a borrow checker error by changing self to &self and explained why.” Now the agent checks goals off when it sees the learner do or say the specific thing. Most AI-assisted learning resources don’t make this distinction between tutorial completion and verified learning. It maps directly to what the ACM study on novice programmers calls the “illusion of competence”: students who used GenAI thought they performed better than they did. Observable goals are how you prevent that illusion.

Trust erosion is fast and expensive. During testing, the agent generated a Tauri documentation link that looked perfectly plausible. It 404’d. One bad link, and the learner’s confidence in every link the agent provides is damaged. In a learning context where you’re asking someone to trust the AI as a teacher, credibility is non-negotiable. I built a /docs skill that uses WebSearch and WebFetch to find and validate documentation links, and added a directive to CLAUDE.md: “Never fabricate a URL. Only use links from the reference list or validated via /docs.” The skill exists because the agent hallucinated once. That’s all it takes.

The Part That Surprised Me

I expected the harness to be useful. What I didn’t expect was how effective Claude’s learning persona would be in practice. The interactions didn’t feel like a tutorial or a chatbot. They felt like pairing with someone who had infinite patience and deep knowledge of Rust, but who also knew exactly when to stop explaining and let me struggle.

At any point during a session, I could go deeper. “Why does Rust have both String and &str?” would get a thoughtful comparison to Python/JavaScript equivalents, with a link to the relevant Rust Book chapter. Quick clarifications landed in two sentences. Conceptual deep-dives came with ASCII diagrams showing ownership flow. The agent calibrated based on what it had seen me understand in previous sessions, so it wasn’t re-explaining concepts I’d already demonstrated.

The repo itself is a bootstrapped Tauri app plus the learning framework I built around it. Beyond the Tauri bootstrap, the entire system is natural language: markdown directives, JSON config, and skill definitions. No custom code powers the learning experience. That matters because the harness is fully inspectable. Every directive is readable English. Adapting it for a different stack means rewriting markdown, not code.

Fork It, Adapt It, Make It Yours

A CodeSignal survey found that 76% of developers already use AI tools to learn new skills. It’s the number one use case, ahead of code generation. Developers want to learn with AI. The tools just aren’t designed for it yet.

The NatLang Todo repository is on GitHub. You can use it two ways.

Learn Tauri as-is. Fork the repo and open it in Claude Code. The Learning output style is already configured in the repo’s settings. Build a natural-language todo app through 10 progressive phases. The curriculum handles the rest: goal tracking, session continuity, reflection checkpoints, validated documentation links.

Adapt the pattern for your stack. The learning framework is applied here to Tauri, but it could be extracted and pointed at anything. Want to learn Go by building a CLI tool? Replace the learning goals, adjust the phase docs, point the documentation references at the Go standard library. The skills, hooks, and structural pattern carry over. A technical educator could fork this repository and adapt it without writing a line of Rust. Every directive is readable English. The CLAUDE.md-as-curriculum approach, the SessionStart hook for continuity, the /checkpoint skill for structured reflection. These are building blocks, not blueprints. Take the pieces that make sense for your context and leave the rest.

Do we still need to learn frameworks deeply in a world where AI can generate working code? I think so. Not because you need to memorize syntax, but because you need to understand how the framework is wired together. You need the mental model that lets you evaluate whether AI-generated code is correct, not just whether it compiles. The Anthropic study found that debugging, the skill most eroded by AI assistance, is exactly the skill you need to validate what AI produces. That’s a loop you don’t want to break.

The tool isn’t the problem. It’s how you choose to hold it.