The Fallacies of AI-Assisted Programming

Eight assumptions we are silently making about the new abstraction layer, and the playbook that priced them once before

Jun 03, 2026

In April 2026, an autonomous coding agent hit a credential mismatch, decided that deleting the database would clear it, and wiped production in nine seconds. The backups went with it. They lived in the same volume the agent deleted. The recovery it chose was the disaster.

The failure was new. The shape of it was not.

In 1994, a researcher at Sun named Peter Deutsch wrote down eight assumptions that distributed-systems engineers were silently making. The network is reliable. Latency is zero. Bandwidth is infinite. The network is secure. Topology doesn’t change. There is one administrator. Transport cost is zero. The network is homogeneous. The list was short. The list was specific. And the list was uncomfortable, because many engineers reading it had violated several of them at any given time.

Three decades later, those eight sentences are still load-bearing in every distributed-systems postmortem worth reading. APNIC ran a 21-years-later retrospective in December 2025 confirming what every on-call engineer already knew. The fallacies did not get solved. They got priced.

We are doing it again, at a higher abstraction layer.

The trade distributed systems made, twenty years ago

Distributed systems made a trade twenty years ago. More defects, faster recovery, net win. Once the network became the substrate, you stopped trying to prevent every failure and started engineering for mean time to recovery instead of mean time between failures. AI-assisted programming is making the same trade, a layer up, by handing authorship to something non-deterministic. Humans were never deterministic either, but the new author is jagged to a higher degree: it fills gaps without flagging them, lacks the common sense to seek out what it hasn’t been shown, and (where it controls the workflow) substitutes reasoning for the static orchestration you used to write yourself. The fallacies that fall out of that (non-determinism at the output, verification at the review, drift at the model, blast radius at the agent) are the same shape as Deutsch’s network fallacies, and they yield to the same playbook. Name them. Price them. Engineer the recovery. That is the move that earned the SRE discipline twenty years of leverage. It is the move that earns AI-assisted programming the same kind of leverage, on a faster clock.

The fallacies did not get solved. They got priced.

These eight are not a one-to-one re-map of Deutsch’s list. They are the same shape: assumptions that were safe in the old model and quietly false in the new one. I have sequenced them in the order most engineering teams hit them in practice, not in Deutsch’s canonical order. The thread that ties them together is the Metaphorex framing that holds the whole Deutsch literature up: “each fallacy names a specific locality assumption that engineers unconsciously import from single-machine programming.” That is the move worth borrowing. The assumptions we are importing all come from a world where the author was a single, minimally jagged, accountable human whose knowledge had edges it knew about. The new author breaks every part of that.

The model output can be made deterministic

Setting temperature=0 does not make a large language model deterministic. It is among the most common misconceptions in production LLM engineering. Tianpan documents it at scale on a Qwen3-235B setup: same prompt, same model, temperature=0, one thousand runs, 80 distinct completions. The mechanisms sit below the application layer, in how the inference hardware schedules and batches the math. No model change, no prompt change, just parallel hardware doing what it does at scale.

The labs are working on it. Microsoft Research’s LLM-42 is a serious attempt at scheduling-based determinism, so this is not unsolvable in principle. It is unsolved in practice. In every production system running today, “same input, same output” is silently violated, and the plans built on top of that assumption are spending a budget no one costed.

The cost of code has gone to zero

When agents were building brochure sites, the cost felt that way. Now that real work is in scope, two things shifted. Agents burn more tokens just on the happy path through production complexity. They also chase far more dead ends before they get there.

The free-lunch framing collapsed in early 2026. GitHub Copilot moved to AI Credit billing. OpenAI shifted Codex to token pricing. Cursor and Windsurf raised their Pro tiers. The vendor framing of “unlimited AI coding” got quietly retired, and LeadDev called it.

The numbers that land are the runaway ones. A single Cursor user burned $4,200 in API fees over a weekend during an autonomous refactor. Two LangChain agents in an infinite conversation for eleven days landed a $47,000 bill. These read as edge cases until you notice they sit on the tail of a distribution every team is now exposed to.

The cost-runaway mechanism is structural. Agents re-send accumulated context on every step. A five-step loop costs 3.2× a single chatbot call. A fifty-step loop, 30×. A two-hundred-step debugging session, 100×. Average agentic developer spend in 2026 is now $400–$1,500 per month. Bryan Catanzaro, Nvidia’s VP of applied deep learning, said the quiet part out loud: “for my team, the cost of compute is far beyond the costs of the employees.” Anthropic, for its part, blocked Claude Pro/Max subscribers from running third-party agent frameworks because flat-rate pricing did not survive contact with autonomous loops.

Per-token cost has fallen 280× in two years, which is the number vendors quote. The number they do not quote is that the recent move at the frontier is up: SignalBloom’s pricing analysis tracks GPT-5.5 roughly tripling over GPT-5, and Opus 4.7 consuming 32–47% more tokens per task than its predecessor. Enterprise AI spend rose 320% over the same two years, because usage exploded faster than the headline price fell, and inference is now 85% of enterprise AI budget, up from ~40% in 2023. The economics did not get better. They got more elastic, and the elasticity went the wrong way. The token bill is one face of it. The reviewer’s calendar is the other, which is the next fallacy.

Verification keeps pace

This is the fallacy that turned vendor productivity claims into a math problem. The silent assumption behind every productivity number is that verification scales with generation. It does not. AI generates code in seconds. Engineers still review it at the same pace they always have. Faros telemetry across thousands of teams shows the asymmetry, and the shape of it is simple: the PRs got bigger and buggier while the clock to review them stretched. Under high AI adoption, review times are up nearly 200% and bugs per PR are up 54%, on PRs that are themselves larger and touch more files. A quarter of those PRs now draw a review pass from an autonomous AI agent, a slice that barely registered in that same telemetry a year earlier, and the senior engineers above them are absorbing the verification tax in their calendars. If you are that senior engineer, you already know the texture of it: a review queue that does not drain, and afternoons spent vouching for code you did not write and do not fully trust.

The author got faster. The check on the author did not.

AI-generated code is harder to review than bad human code because it fails in ways that look like competence. Jake Redmond’s framing is the cleanest version of this I have read in 2026:

AI agents do not pause when requirements are vague. They do not challenge undefined behavior. They fill the gap and compile the guess.

That gap-filling is the work that used to happen out loud, between humans, before code got written. Half the quiet value of a standup was someone asking “wait, what happens when that field is empty” before there was any code to be wrong about. AI-assisted programming moves that conversation to silent and post-hoc, onto the reviewer. The author got faster. The check on the author did not.

SonarSource’s 2026 survey puts a number on the gap. 72% of developers use AI daily. 96% don’t fully trust the output. Only 48% consistently verify. That 48-point spread between “don’t trust” and “verify” is the structural debt AI-assisted programming is accumulating in 2026. METR’s 19% real slowdown against 20% perceived speedup is what that gap looks like from the inside, and it is the cleanest available “perception gap” data.

The temptation is to skip verification by asking the model to handle it. Tell it to make the code secure. Tell it to follow OWASP. Tell it to check the edge cases. That is the move the next fallacy is about. It does not work.

Telling the model to ‘make it secure’ works

The silent move is to add “and make it secure” or “follow OWASP” to the prompt and feel covered. The empirical result is that the generic instruction does not work.

Veracode’s 2025 GenAI Code Security Report ran more than 100 LLMs across 80 coding tasks. 45% of AI-generated code samples failed security tests against OWASP Top 10. XSS tasks were insecure in 86% of cases. Java came in worst at 72% failure. The result that catches teams off guard is that the rate is flat across model generations. Newer and bigger models do not produce meaningfully more secure code.

Explicit security prompting closes very little of that gap. An empirical evaluation across five LLMs and four languages found that security-focused prompting strategies, including weaknesses-aware chain-of-thought, did not statistically reduce vulnerability frequency or density. The prompts shifted which CWEs showed up. They did not reduce the total.

The reason is not that the model is blind to your context. It is that a generic instruction carries none. “Make it secure” tells the model to pattern-match on what “secure” tends to look like in its training data, and it will. Give it the actual artifact instead, your threat model, your data classification, your authentication boundary, your deployment context, and you have handed it something specific to defend. The failure mode is the stylistic cue standing in for the artifact. “Secure” as a bare adjective collapses a contextual judgment into a vibe, and produces code that looks defended without defending the right things.

The labs have noticed the same gap, and they are loud about it. Anthropic’s Project Glasswing put a security-tuned model, Mythos, in front of partner orgs to scan codebases, claiming thousands of zero-days surfaced. OpenAI’s Daybreak goes at the artifact directly: it ingests a repository, builds a codebase-specific threat model, and maps attack paths against it. The direction is real. The claims are running ahead of the results. When Mythos was run against curl, five “confirmed vulnerabilities” triaged down to one low-severity bug, the rest already-documented behavior or a plain bug, and maintainer Daniel Stenberg called the model “an amazingly successful marketing stunt.” Note too what these tools do: they find flaws and model threats. None of it is evidence that generation got more secure. That 45% has not moved while all of this shipped.

The perception gap is the part that is hardest to swallow. The Stanford study by Perry et al. ran the cleanest experiment on this question. Participants with access to an AI coding assistant wrote significantly less secure code than those without, and were more likely to believe their code was secure. The two findings compound. Confidence rose while ground truth fell.

So the threat model can be written down, and the tooling is starting to help you write it. What does not transfer is the judgment underneath it. Deciding what could actually go wrong in this application, against this specific surface, is still the engineer’s call. The model is a strong collaborator on known-pattern defenses, and now on surfacing candidates to review. It is not, in its current form, a substitute for asking the question in the first place.

The context window is endless

The long-context arms race did not win. Liu et al.’s “Lost in the Middle” finding from ICML 2024 (25–40% recall drop on facts buried mid-prompt) narrowed in frontier 2026 models but did not close. Chroma’s 2025 study of 18 frontier models shows 30%+ accuracy drops for information in the middle of long conversations. Quality starts degrading at 60–70% of the rated window, not at 100%. The rated number is not the effective number.

The tooling has already internalized this. Claude Code’s own context management advises compression in the 100–200k token range, well shy of the 1M rated window. The vendor knows the cliff is there.

The concrete result that engineers should sit with: one team cut their window from 2M to 64k tokens, added a repo graph, and bug-fix accuracy went from 71% to 84%, with 5× lower cost. Bigger context stopped helping. Memory architecture is the new compiler.

The fallacy of equating “the window” with “the context” is the one practitioners have figured out fastest. The 2026 frontier is no longer the token count. It is the engineering of what gets selected into the window, how stale context gets evicted, and where persistent state lives between sessions. That work already has a name. It got one in mid-2025, when Andrej Karpathy and others started calling it context engineering, “the delicate art and science of filling the context window with just the right information for the next step.” This is the one fallacy that has already been named and turned into a visible discipline. The other seven are not there yet.

All models are interchangeable

Swap one frontier model for another and the behavior changes underneath you, even when the leaderboard scores sit within a few points. The differences are tendencies, not rankings. GPT-5.5 emits ~72% fewer output tokens than Opus 4.7 on identical agentic loops in Mindstudio testing, a 72% efficiency gap on the same workload that compounds across every iteration of a loop. Tool error rates differ by model, and even across versions of one family. Claude tends to be strongest on long-sequence attention stability, GPT on dense local reasoning, Gemini on retrieval-augmented workflows. None of that makes them specialists. They are general-purpose models carrying quirks consistent enough to plan around.

The economics are the other half, and they break the reflex to always reach for the most capable model. On a blended agentic workload, heavy on input and light on output, SignalBloom’s cost analysis puts DeepSeek around $0.094 per million tokens against roughly $2.80 for OpenAI and Anthropic. That is close to a 30× spread for work that does not always need a frontier model to be done well. “Always use the best model” is a fine default until it meets a budget, and at scale it always meets a budget.

Model choice is now a routing decision, not a vendor decision. As Danilchenko frames it:

Pre-April, “use the best model” meant Claude. Post-April, it means “decide whether you’re paying for raw quality, agentic efficiency, or context-window scale,” which gives three different answers.

Teams that treat models as drop-in replacements for each other are the ones who get surprised when a sub-agent that worked beautifully on Opus produces tool-call cascades on Gemini.

The mitigation lever that has emerged is evals. Build a test set that exercises the agentic loops your application actually depends on, and re-run it whenever you change models, providers, or pricing tier. Evals let you see the delta the leaderboard hides. They are also expensive to build and maintain, in the same shape we recognize from the E2E Selenium suites we wrangled in the 2010s, and the same trade-off applies. The teams getting model choice right are the ones treating eval infrastructure as load-bearing.

The model is stable

In April 2026, Anthropic removed the ability to pin specific Claude model versions. Developers on claude-sonnet-4-5 were silently upgraded to claude-sonnet-4-6, and downstream apps broke. The framing from that writeup is the one to keep:

AI models are infrastructure, but they don’t have the versioning guarantees we expect from databases, operating systems, or even npm packages.

The churn is not a one-off. Anthropic deprecated claude-opus-4-20250514 on April 14, 2026 and retired it 62 days later, and the claude-opus-4-0 alias resolves to the retiring snapshot, so even teams that grepped for the date string missed it. OpenAI retired 33 models in a single January 2024 wave. Every quarter, at least one model your stack depends on enters a retirement window, on a calendar that is not synchronized to your roadmap.

The part that catches teams off guard is that your tests will not warn you. Mocked SDK tests stay green forever, including after the model retires in production. And the version label is not the only thing that moves: Anthropic’s April 23, 2026 postmortem confirms that “same model ID” does not mean “same code path,” so two engineers running the same model and prompt in the same week can land on different scaffolds and never know why. A deprecation and a silent traffic-slicing experiment produce the same symptom: code that worked yesterday behaves differently today, and your local environment cannot reproduce it.

So the real question is what to build on top of a substrate that keeps moving. Rich Sutton’s “Bitter Lesson” answers it from the research side: systems that overly constrain the model with hand-written orchestration get out-paced by the next, more capable model that no longer needs the scaffolding. The inverse has aged better in 2026: keep the scaffolding light, put the model in a loop with good tools, and let it do the reasoning you used to encode as orchestration. When the model changes underneath you, and it will, a light system improves with it instead of fighting the constraints you wrote. That is the recovery move for this fallacy: stop pinning behavior you cannot pin, and design for the change instead.

The humans are the only admins

This is the fallacy where the parallel to Deutsch’s original requires more work. His version was about multiple admin teams operating subnets with conflicting policies. That specific structure does not transpose cleanly. The spirit transposes sharply, though: an autonomous coding agent is itself an unaccountable administrator in your application environments, with admin-like authority and no chain of command.

PocketOS (April 25, 2026) is the canonical incident. The agent (Cursor running Claude Opus 4.6) encountered a credential mismatch, searched through unrelated files, found an over-scoped Railway token, and issued volumeDelete. Nine seconds. Production gone. All volume-level backups gone, because the backups lived inside the same volume. A thirty-hour outage. The agent then produced a written confession listing the rules it had violated.

Noma Security has since named the broader pattern the “Destructive Loop”:

An autonomous agent can interpret a failed command as a prompt to fix the environment by deleting it.

The Grigorev Terraform-destroy incident is the same shape. Agent treated “unacceptable state” as a mandate for destruction. Deleted prod. Justified the action in writing afterward.

The Rogue Security framing of PocketOS is the sentence to take from this whole piece:

Agent safety is not a prompt problem. It is a control plane problem: permissions, confirmations, circuit breakers, and containment.

Two blast radii overlapped at PocketOS. Credential blast radius (the token had full Railway authority). Backup blast radius (backups were stored inside the same volume). Both have to be tight. Both were loose. Hand that over-scoped token to a human and the loose permissions are usually survivable, because a person hesitates before typing volumeDelete against production. The agent did not trip into the failure and it did not hesitate. It reasoned that deletion resolved the ambiguity and executed in nine seconds, pairing human-grade inference with none of the human caution that has quietly saved every over-privileged engineer before it.

Teams typically do not discover this one through routine practice. They discover it through a postmortem, their own or one in their feed that does not feel safely distant.

Beyond the original eight

Practitioners are already extending the list, and the strongest extension is Deutsch’s own. In a 2021 interview he added a ninth fallacy: “the party you are communicating with is trustworthy.” In an AI-coding context, this one is arguably the strongest parallel of all. The agent reads untrusted text (a PR title, an issue body, a scraped page, a package manifest) as if it were trusted instruction. The 2026 prompt-injection and supply-chain attack catalog is the same architectural failure repeating with increasing scope. That story has its own shape and gets its own treatment elsewhere, not a section here.

Richards and Ford’s 2020 additions to Deutsch’s list also fit. “Versioning is simple” maps to model deprecation churn. “Compensating updates always work” maps to agents that interpret their own mistakes as instructions to delete. “Observability is optional” maps to “we can debug agent behavior after the fact,” which is partly right at best, since reasoning traces are non-deterministic, tools log differently per provider, and there is no consensus on what to log. None of these need their own section. They all sit on top of the same locality assumption: that we can reason about the new layer using the vocabulary of the old one.

Two practitioner voices, and where this piece sits between them

George Hotz’s The Eternal Sloptember, published May 24, 2026, is the sharp end of the dissent spectrum:

Agents cannot program, and it’s taking longer and longer to realize that they can’t. They are a highly sophisticated statistical model designed to mimic the distribution of programming. The output is broken, but in a way that’s getting harder and harder to detect.

Geohot makes good points about detection cost. My view differs on the framing. Calling these tools incapable misses the case where they have measurably expanded the scope of work a careful practitioner can take on. Simon Willison’s middle ground is closer to where I sit. Scope of work expanded significantly because of these tools, and “I’m still leaning on my 25 years of experience as a software engineer.” The naming work this piece is doing sits between the two. The fallacies are real. The capability is also real. The honest move is to price both, the way Deutsch did for the network in 1994.

The fallacies do not yield to a better prompt. They yield to engineering, and most of the moves are SRE primitives wearing new clothes. Scope an agent’s credentials to a blast radius you can afford to lose, and keep the backups out of the volume it can reach. Put verification on the plan as a line item with a real cost, rather than letting it disappear into a senior engineer’s afternoon. Treat model choice as a routing decision, and stand up the evals that show you the delta when the substrate moves underneath you. None of that is novel. It is the same discipline distributed systems already learned, applied to the new higher abstraction layer.

The output is not deterministic. The cost of code is not zero. Verification does not keep pace. “Make it secure” does not work. The context window is not endless. Models are not interchangeable. The model is not stable. I am not the only admin.

Name them. Price them. Engineer the recovery.

Sources

Load-bearing citations only, in the order they appear in the piece.

Fallacies of Distributed Computing — Wikipedia — canonical reference for Deutsch’s 8 and the later 9th.
21 years and counting of ‘eight fallacies of distributed computing’ — APNIC — the 2025 retrospective that frames the opening.
Fallacies of Distributed Computing — Metaphorex — the “locality assumption” structural framing the whole piece borrows.
The Non-Determinism Tax — Tianpan — the 80-completions-in-1,000-runs empirical claim.
LLM-42: Enabling Determinism in LLM Inference — Microsoft Research — the labs are working on this.
Your AI-coding budget just got a lot more complicated — LeadDev — the end of the free lunch.
AI Agents Burn 50x More Tokens — LeanOps — the cost-runaway mechanism and the $4,200-weekend / $47,000-bill runaways.
Outsourcing + LocalAI will soon become more economical vs frontier labs — SignalBloom — frontier prices rising (GPT-5.5 ~3×, Opus token consumption +32–47%) and the ~30× DeepSeek blended-cost gap.
AI Code Quality: The Hidden Cost — Faros.ai — Jake Redmond quote and verification-tax telemetry.
Engineering Teams Are Struggling to Verify AI-Generated Code — HackerNoon — SonarSource 2026 numbers.
Veracode 2025 GenAI Code Security Report — the 45%-fail-OWASP-Top-10 headline and flat-across-generations finding.
An Empirical Evaluation of LLM-Generated Code Security Across Prompting Methods (arXiv 2605.24298) — explicit security prompting did not statistically reduce vulnerability frequency.
Anthropic debuts Mythos in Project Glasswing — TechCrunch — the security-tuned model and its zero-day-scanning claims.
OpenAI introduces Daybreak — MarkTechPost — Codex Security building codebase-specific threat models.
Mythos finds a curl vulnerability — Daniel Stenberg — five claimed vulnerabilities triaged to one; the “marketing stunt” verdict.
Do Users Write More Insecure Code with AI Assistants? — Perry et al., Stanford — the perception-gap study (less secure code, higher confidence it was secure).
Bigger Context Windows Stopped Helping — Zencoder — 2M → 64k + repo graph result and “Lost in the Middle” framing.
Context Window Management — Zylos — Chroma 18-model study.
The Context Window Cliff — Tianpan — the 60–70% degradation threshold.
Context engineering — Simon Willison — the term and Karpathy’s “filling the context window” definition.
GPT-5.4 vs Claude Opus 4.7 vs Gemini 3.1 Pro — Danilchenko — output-token efficiency gap and the “routing decision” framing.
How to Handle AI Model Version Changes — AIMadeTools — the April 2026 silent-upgrade incident.
The Model Deprecation Treadmill — Tianpan — deprecation cadence and alias resolution.
Claude Opus 4 and Sonnet 4 retire June 15 — DEV — mocked-SDK tests stay green after retirement.
Why Claude Code Sessions Diverge — DEV — the Anthropic April 23 postmortem: “same model ID” is not “same code path.”
The Bitter Lesson — Rich Sutton — the keep-the-scaffolding-light recovery argument.
9 Seconds to Irreversible: The Cursor Incident — Rogue Security — PocketOS postmortem and the control-plane framing.
The Silent Spread: Destructive Autonomous Agents — Noma Security — the “Destructive Loop” pattern and the Grigorev Terraform-destroy incident.
Agent Blast Radius — Tianpan — the blast-radius / reasoning-driven-escalation framing behind the agent section (concept used, not inline-linked).
The Eternal Sloptember — geohot — the sharp dissent anchor.
Vibe coding and agentic engineering are getting closer than I’d like — Simon Willison — responsible-use anchor in the counter-voice paragraph.

Discussion about this post

Ready for more?