Seven days, 194 tasks, one daemon

8 min read

In March 2026 I wanted a daemon that could drive Claude Code, OpenAI Codex, and Gemini CLI from the same conversation, hold state across restarts, and work from my phone. Nothing did all of it. So I wrote a seed brief, handed it to do-work, and started feeding it follow-ups as the shape of the thing clarified.

Seven days and 47 briefs later do-work had shipped relay: 43 source files, 12,500 lines of C, 573 hermetic tests, one 263 KB static binary. 194 discrete tasks. Test, commit, next task. The do-work/ directory for this run stays local (by design, it often contains briefs I haven't decided to make public), so the run-by-run trail isn't in the relay repo. For a public example of what a do-work run looks like, see how do-work built itself at github.com/rawphp/do-work/tree/main/do-work, 32 archived REQs and counting.

This post is about what do-work did, and what I learned watching it.

What do-work is

do-work is a Claude Code skill. It's not an agent framework, not a planner, not a swarm. It's a file-based autonomous task loop.

You write a brief in user-requests/UR-NNN/input.md. do-work reads it, decomposes it into discrete tasks, writes each one to disk as REQ-NNN.md, and executes them one at a time. Test-driven. One git commit per task. When a task is done, it archives. When the queue is empty, it stops.

There's no cleverness. The whole system is a few markdown conventions, a few agents, and a loop. The discipline is in the file layout, not the prompt. The code is on GitHub at github.com/rawphp/do-work.

The briefs

The seed brief for relay was one paragraph. It named the daemon, the language (C, static binary, no runtime), the transports (Telegram first, Unix-socket bus for agent-to-agent), the providers (Claude Code, OpenAI Codex, Gemini CLI), and the non-negotiables: hermetic tests, dependency injection for every syscall, one commit per task.

No architecture diagram. No module list. No file tree. do-work decomposed the brief into REQs, picked the first one, wrote the failing test, made it pass, committed, moved on. The shape of the codebase emerged from the constraints, one task at a time.

When the first brief ran dry I wrote another. 46 more followed, each dropped into do-work/user-requests/UR-NNN/input.md before do-work touched it.

The range is wider than you'd expect. UR-001 was two sentences: "Real-life command is not listed in telegram commands. When issuing the help command, there are only three commands listed; all the others are missing." A bug report, nothing more. UR-015 was 800 words of architectural thinking about skill manifests, trigger-based routing, health checks, and a first implementation step. UR-047 was a single line asking do-work to bring the docs up to date with the agent bus work it had just shipped.

Bug report. Design spec. Doc update. Same envelope, same loop, same one-commit-per-task discipline.

The pattern that emerged: I'd write a brief, let do-work work, check the archived REQs, and the next brief would write itself from what I'd just read. The briefs got sharper as the codebase taught me what to ask for next.

The seven days

194 tasks in seven days is one task every 52 minutes across 24-hour days, or roughly one every 25 minutes of waking time. I wasn't at the keyboard for most of it.

The directory tells the story better than I can. do-work/user-requests/archive/ holds every brief. do-work/decisions/ holds every architectural choice made mid-run - peer advertisement protocol, session cache format, path encoding scheme - captured the moment they were made, not after. do-work/tasks/archive/ holds the REQ files, each one a tiny postmortem: what was asked, what was done, which tests were added, which commit shipped it.

When I came back to check, the cadence was steady. Red test, green test, commit, next task. The log was boring in the way production systems are boring: nothing interesting happening because everything's working.

Bisectable by commit. Replayable by REQ. Auditable by decision.

What do-work got right

The TDD discipline survived seven days of autonomous execution. 573 tests, all hermetic. Not one docker-compose. Not one integration test masquerading as a unit test. Every external dependency - HTTP, process spawning, clock, filesystem - is behind a DI struct in relay.h with a mock in tests/mocks.h. That pattern was the brief's only hard architectural constraint and it held the whole run together.

Decisions were captured in flight. When do-work had to choose between polling and inotify for the peer registry, it wrote the trade-off down before it chose, not afterwards. That folder is the only design documentation the project needs.

One commit per task made the run bisectable. do-work's own repo is the public example: 32 archived REQs map one-to-one to the commits that shipped them. When I later found a bug in the bus dead-drop code, git bisect landed on the exact REQ in two minutes. The commit message pointed at the REQ. The REQ pointed at the test. The test pointed at the line.

Boring, legible, reversible. That's what autonomy needs to be safe.

What do-work got wrong

Fuzzy briefs produced fuzzy decomposition. Early on I asked for "phone-friendly ergonomics" without defining what that meant. do-work invented an answer. The answer was reasonable. It wasn't what I wanted. I got the architecture I briefed for, not the one I would have sketched if I'd sat down with a whiteboard for an hour first.

Mid-run steering was expensive - for a while. In the early days, interrupting the loop to redirect it meant a cold restart: write a new UR, resume, wait for it to re-read the context. Cheap if you did it once. Painful if you did it five times. So we grew a verify stage. Now /do-work go scores REQ coverage against the brief before executing anything - 0 to 100%, auto-executes at 90% or above, lists the gaps below that. An audit stage runs alongside it and interrogates every REQ's acceptance criteria, auto-fixing vague spots and reporting what it changed. The effect is that most steering happens before the loop runs, not during it. The problem didn't stay a problem - it turned into a feature.

Some tasks needed human fixing at the end. A handful of edge cases in the Telegram long-poll path passed tests but were subtly wrong against the real API. The tests were green. The tests were incomplete. That's not do-work's fault - it's the cost of hermetic testing against a mocked API surface. It's still a cost.

The honest summary: brief it sharper than you think you need to, because you'll get exactly what you ask for. verify and audit will tell you when the decomposition missed something. They won't tell you when you asked for the wrong thing.

What relay is, briefly

relay is a persistent C daemon that polls Telegram, routes messages to Claude Code (or Codex, or Gemini) as a subprocess, and ships replies back. Sessions survive restarts. Workspaces are configurable. An agent bus over Unix sockets lets multiple relay instances talk to each other and to you, with dead-drop persistence for offline agents.

263 KB static binary. 43 C files. 573 tests. Four transport adapters. One config file. No runtime dependencies beyond libc and libcurl.

Architecture detail is in ARCHITECTURE.md - the document do-work maintained as it built.

Why I stopped using relay

I used it for a few weeks. It worked exactly as designed.

Telegram turned out to be a great UI. The phone-on-the-train pattern was real - fire off a brief, let relay chew on it, come back to commits. A long-lived agent with its own memory, its own bus, its own inbox, reachable from anywhere I had signal. The architecture delivered.

Then I re-read Anthropic's Consumer Terms. Section 3 prohibits "to access[ing] the Services through automated or non-human means, whether through a bot, script, or otherwise" except via an API key or where Anthropic explicitly permits it. relay is a daemon invoking Claude Code as a subprocess on a subscription - a bot driving the Service by any reasonable reading. The workflow relay depends on isn't within the rules as I read them.

The API was the obvious fallback. I didn't want to use it. I was already paying A$340/month for the subscription - that's a serious investment, and stacking API spend on top to keep an experiment running wasn't something I was willing to do.

So I stopped. Not because relay was wrong. Because the bill of materials changed underneath it.

What I use now

Claude Code, terminal only. Nothing else. Opus 4.7 for most work, Sonnet for fast turns. A laptop. 14 skills cover the writing, design, and research surface - author, editor, humanize, content-engine, landing-page-designer, researcher, skill-optimizer, and the rest. 25+ agent systems handle the bigger stuff - saas-cofounder, goal-system, agent-builder, google-ads-strategist, social-marketing-manager, and on. do-work is the spine for anything larger than a single commit.

The stack got smaller, not bigger. That's usually a good sign.

Closer

relay did what I asked of it. Seven days to build, a few weeks in daily use, then a policy change I didn't control took it off the table. The code still runs. The architecture still holds. I just don't run it.

What I'm left with is the clearest case study I have for do-work. 47 briefs, 194 tasks, 573 tests, one commit per task, all traceable end to end. A working daemon I'd happily run again if the rules shifted back (github.com/rawphp/relay). And the loop that built it, still sitting in ~/.claude/skills/do-work/, ready for the next brief.

do-work is on GitHub at github.com/rawphp/do-work. MIT-licensed. Point it at a brief, get out of its way.

Seven days, 194 tasks, one daemon

What do-work is

The briefs

The seven days

What do-work got right

What do-work got wrong

What relay is, briefly

Why I stopped using relay

What I use now

Closer

Replies

Like this post?

Related posts

Vibe Coding Without Guardrails is a Disaster

Why Meso Tracks Two Streams, Not Streaks

Chapter 1 - Why willpower-based transformation fails, and what ten people installed instead

The Mental Architecture of Physical Transformation