The system that built this site

What the system is

The short version: Claude Code running as an orchestrator that delegates work to specialist AI agents, each in its own isolated git worktree so nothing tramples anything else. Around them sit enforcement hooks — rules that block bad actions mechanically, not politely. A force-push, a write outside the project, a hardcoded credential: blocked, not warned about. The whole configuration is the same one I run my own operation on, and I put more than a hundred million tokens a week through it on real work. This site is some of that work, which means you’re reading the output of the thing being described. Treat the rest of this essay as a worked example.

Eleven reviewers, then the firing squad

Before launch I put eleven reviewers on the site: a per-surface auditor for every page, a CRO, a deliberately skeptical ops-director persona, a launch PM. Each produced findings. Then every finding went to a second set of agents with one instruction: refute this against the actual code. Eighty-six agents in total, arguing about my website.

Roughly a quarter of the “critical” findings died in verification. That number is the point, not an embarrassment. One model’s confident opinion is worth little — a review is just more junior work, and it needs checking like everything else the junior produces. A finding that survives a serious attempt to kill it is worth acting on. A finding that doesn’t was about to cost me a day.

What it catches at 2am

The design round ran the same shape with seven more perspectives: design-system drift, brand, three different buyer personas, funnel coherence. Thirty-three agents this time. What they caught is exactly the kind of thing humans miss at two in the morning. Two page shells had drifted three pixels apart on their underline offsets. A running-cost figure said $0.10 a month on one page and $1 a month one click away. Chasing the honest number, it turned out to be roughly $1.50 — so all three pages now say that. Not the prettiest of the three numbers. The true one.

What I actually did

The build that launched this site was six waves of implementation agents working in parallel, each wave gated by a clean build and a smoke test before the next started, and three production deploys to finish. Those were the launch numbers; they didn’t stay still. The machine kept running after launch, and the log at the end of this essay is where the totals stand now. My job through all of it: set direction, pick between the options the panels surfaced, veto, approve. When a buyer persona flagged that the hero example was pulling the wrong audience, swapping it was my call. The reply-time promise on the homepage is mine. The blue ‘oh’ in the wordmark stays because I like it, and the reviewers have been told to stop flagging it.

Which is the honest punchline: the AI didn’t decide anything that mattered. It argued, verified, built, and deployed — at a volume and pace no human team would ever point at one consultant’s website. Knowing which hour to go after, and which finding to ignore, was the job. The system itself is part of the proof I point people at, because a builder’s own tools are the least fakeable evidence there is.

When the verdict beat the first answer

This pattern earned its keep before the site existed. A while back I ran a five-round agent debate on a checkout page, and the final verdict reversed the model’s own opening recommendation: treat the buy button as a price anchor, not a conversion driver. Asked once, the model gave the conventional answer. Pushed through five rounds of structured argument, it talked itself out of it — and the verdict was the better answer. The value was never in the first opinion. It’s in the machinery that makes opinions fight.

Where the human earned the seat

Updated 12 June 2026: the essay originally ended at launch. The machine kept running, so this section and the log at the end were added.

The panels and I have disagreed since launch, and the record shows who won what. The best example is one homepage-ordering question that got its own four-persona panel. That panel, and a review round before it, both recommended keeping the small cloneable examples at the top of the proof block: easy to verify, low friction, let a visitor try something inside half an hour. I overrode them on value-weighted grounds — you don’t risk the big buyer to teach the small one. Then I proposed the synthesis myself: one large proof, the build of this site, with the small examples kept underneath as verification links. My version shipped. It reads better than either of the versions the panels were arguing over, and the panels couldn’t have produced it, because the weighting came from knowing what the business is for.

The audits cut the other way too. The readiness round that ran before any advertising money moved caught the site claiming that a piece of its own machinery was open to public inspection. It wasn’t. A checkable claim that fails the check is worse than no claim at all, so it came off the site before a single ad ran.

And once, the system enforced honesty on me. Site copy here may only state facts that sit in an approved-claims bank, and a time-saved figure crept onto a page that was plausible, flattering, and not on the list. The gate struck it. The number was then checked properly, approved, and restored with the right attribution. Same words on the page, different standing underneath them: the second time, somebody had stood behind it.

That’s the division of labour this essay is really about. The machine argues, verifies, builds, and keeps the records straight, at a volume and pace no human team would ever get. The judgment calls stayed human: which finding to act on, which panel to override, which claim is allowed on the page. So did the final word. The site is better for both halves, and neither half would have been enough alone.

Why this matters if you’re buying

Because this is what I mean by applied AI: the judgment stays human, the production is delegated, and every claim gets attacked before it ships. It’s the same shape I install for clients, sized to the stakes. The review layers get heavier as the cost of a mistake goes up, lighter where a miss is cheap. A marketing site can live with a three-pixel drift for a day. An invoice can’t go out wrong once.

And it answers the question you should be asking any consultant: how do they ship their own things? This is how I ship mine. Adversarial review before launch, honest numbers even when a prettier one was already live, a gate before every deploy. My own site got that treatment, and it’s the floor, not the ceiling — the same rigour gets heavier, not lighter, when the work carries someone else’s name.

The log, so far

For the record, the running totals as of this update. The launch numbers earlier in this essay were a snapshot; since the ninth of June the machine has run sixteen-plus working waves and 23 commits, across roughly seventeen production deploys. Total agent runs passed 300 somewhere along the way. Three adversarial review rounds have run so far, the launch review at 86 agents the largest of them. The design panel added 33 more, the pre-advertising audit another fourteen. Across all three rounds, roughly one critical finding in four died under verification — the machinery keeps killing its own bad ideas before they cost anything. These totals will be stale by the time you read them, because this is how the site runs now, permanently.