Tilth gets a Flight Recorder

Seeing what a long-running agent actually does

Jun 19, 2026

A quick progress update on Tilth, the small agent harness I introduced a while back for running open-weights models against real coding tasks. Most of the recent work has gone into one thing: being able to see what a session is doing, while it runs and after it finishes.

This is in the form of a web app composed of a dashboard and a visualized stream of the activity logs for a session.
Here we see the dashboard section for a finished session:

Tilth's dashboard for a finished session: status all_done, 285k tokens across 4 tasks. It shows limit utilization well under the token and wall-clock caps, per-task iteration counts (7, 12, 9, 13 against a cap of 32), four accept verdicts and zero rejects, a session timeline of task spans with verdict markers, and a context-pressure bar chart where prompt tokens climb within each task and reset at task boundaries.

Four tasks, start to finish, in about six and a half minutes and 285k tokens. The reason I keep coming back to this view is that it answers a question the exit code can’t. A run that finishes and a run that finishes well look identical if all you have is all_done. The difference is in the shape.

A few patterns I read off it now without thinking:

Iteration counts spread evenly and well under the cap: a healthy run. One task pinned near the ceiling is one that got stuck.
Mostly accepts, few rejects: the worker and the evaluator agree. A wall of rejects means they are talking past each other.
Context pressure that climbs through a task and resets at each boundary: the out-of-context memory doing its job. A line that never resets is context quietly bloating.

Here is a fresh one kicking off: a path, a model, and the loop starts turning.

Animated terminal recording of a Tilth run launching: the `tilth run` command loads four tasks, prints the session id, worktree and branch, selects the deepseek-v4-flash model, and begins task T-001 at iteration 1.

The model and harness activity stream live in the web view, filling in event by event.

Animated Tilth dashboard for a running session, status "running", with the summary panels filling in live and a Communication feed streaming events (session start, tool calls, hook checks, results) as they happen, with a follow button pinned to the latest event.

And I can drop into any single iteration and read it like a transcript: the evaluator’s verdict with its reasoning, the ledger entry, the commit.

A single Tilth iteration read as a transcript: iteration 12 with its token counts, the evaluator's accept verdict and its reasoning against the acceptance criteria, the appended ledger entry, a task-done summary, and the commit hash.

None of this is novel as “observability.” There is a lot of good writing on agent tracing right now. What I’d flag is smaller and more practical. I built all of this for myself, to debug my own runs, and the live view came almost for free, because every panel is a render of the same append-only event log. Build the log for the autopsy and the live dashboard falls out of it.

Why I care about this particular instrument: I am trying to work out how far cheaper open-weights models can be trusted with autonomous work. You can’t answer that from a pass or fail. You have to watch how the model gets there, and notice when it thrashes. This is the thing that makes that legible.

That is the progress. It is the instrument, not the study. The study is the next step: I want to batch-evaluate a finished session and score it on effectiveness and efficiency, so the shape I am reading by eye becomes something gradable.

The rest of the details are in the Tilth docs.

Discussion about this post

Ready for more?