Tilth gets a Flight Recorder
Seeing what a long-running agent actually does
A quick progress update on Tilth, the small agent harness I introduced a while back for running open-weights models against real coding tasks. Most of the recent work has gone into one thing: being able to see what a session is doing, while it runs and after it finishes.
This is in the form of a web app composed of a dashboard and a visualized stream of the activity logs for a session.
Here we see the dashboard section for a finished session:
Four tasks, start to finish, in about six and a half minutes and 285k tokens. The reason I keep coming back to this view is that it answers a question the exit code can’t. A run that finishes and a run that finishes well look identical if all you have is all_done. The difference is in the shape.
A few patterns I read off it now without thinking:
Iteration counts spread evenly and well under the cap: a healthy run. One task pinned near the ceiling is one that got stuck.
Mostly accepts, few rejects: the worker and the evaluator agree. A wall of rejects means they are talking past each other.
Context pressure that climbs through a task and resets at each boundary: the out-of-context memory doing its job. A line that never resets is context quietly bloating.
Here is a fresh one kicking off: a path, a model, and the loop starts turning.
The model and harness activity stream live in the web view, filling in event by event.
And I can drop into any single iteration and read it like a transcript: the evaluator’s verdict with its reasoning, the ledger entry, the commit.
None of this is novel as “observability.” There is a lot of good writing on agent tracing right now. What I’d flag is smaller and more practical. I built all of this for myself, to debug my own runs, and the live view came almost for free, because every panel is a render of the same append-only event log. Build the log for the autopsy and the live dashboard falls out of it.
Why I care about this particular instrument: I am trying to work out how far cheaper open-weights models can be trusted with autonomous work. You can’t answer that from a pass or fail. You have to watch how the model gets there, and notice when it thrashes. This is the thing that makes that legible.
That is the progress. It is the instrument, not the study. The study is the next step: I want to batch-evaluate a finished session and score it on effectiveness and efficiency, so the shape I am reading by eye becomes something gradable.
The rest of the details are in the Tilth docs.






