57 Pull Requests in 6 Days

· 6 min read #clawshell#ai#multiplexing#building-in-public#refactoring

57 Pull Requests in 6 Days

Six weeks ago I wrote about running five AI agents in parallel and realizing I was the slowest part of the system. I was reviewing every diff, approving every merge, context-switching across five Telegram groups. The throughput was impressive but the bottleneck was obvious: me.

So I fired myself from the loop.

What changed

I promoted Daneel from “dev agent” to dev lead. Not metaphorically — structurally. The dev lead session now has a cron job that wakes it every 5 minutes to check on all five dev agents. When a PR is ready, the lead reviews it. Not me. The lead reads the diff, checks the tests, iterates with the dev agent if something’s off, and loops until the PR is clean. Then the lead merges into main, resolves any rebase conflicts with the other in-flight branches, restarts the Expo dev server, and updates the tracking gist.

I get a notification when all five PRs from a batch are merged and the dev server is running on main. That’s when I pick up my phone and test.

My job is now: use the app, report what feels wrong, throw the next batch of five tasks over the wall. That’s it. I don’t read code. I don’t approve merges. I test the product.

The orchestration

When I drop a chunk of five tasks, the lead doesn’t just blindly dispatch them to five agents. It checks dependencies first. “Task 3 touches the same file as task 1 — schedule 3 after 1 merges.” “Tasks 2 and 5 are independent — run them in parallel.” The goal is to minimize rebase conflicts, because with five agents pushing branches simultaneously, conflicts are the #1 time sink.

The lead learned this the hard way. Dev 5 once pushed a rebase that would have silently reverted four other PRs. The lead caught it in the diff. That’s when I decided the lead reviews everything — no direct merges from dev agents, ever.

The sprint

March 21–26. I had a GitHub gist with every TODO item — P0 through P2, grouped by theme. I started feeding it in chunks of five.

DayPRs mergedHighlights
Mar 214Image sending, TTS gap elimination
Mar 227Image thumbnails, WS compression, iOS volume bug
Mar 233Conversation search, haptics, background audio
Mar 2426The Big One™
Mar 257Error retry, dead code cleanup, config consolidation
Mar 2610Push notifications, swipe navigation, custom instructions

57 PRs. 134 total since the repo was created seven weeks ago.

March 24: 26 PRs

The big refactoring day. App.tsx had grown to 3,385 lines — everything lived in one file. Three PRs split it: 6 hooks extracted, 6 components extracted, 33 setStatus() calls replaced with a typed finite state machine. Down to 840 lines.

Same day: secure storage migration (API keys out of plaintext AsyncStorage), FlatList virtualization, debounced markdown rendering, streaming display with blinking cursor, voice speed control. Plus 172 new tests — from 382 to 554 in a single day.

I didn’t write any of that. I was testing the app on my phone, reporting regressions between batches. “Scroll jumps when response finishes.” “Haptic on record stop feels too strong.” “TTS drops between chunks 4 and 5.” Two sentences each. The lead would triage and dispatch.

Builds kept breaking

With this pace, main broke. More than once. An agent would merge something that passed its own tests but broke an unrelated feature. I’d install the new build, see a white screen, and lose 20 minutes.

So I made the lead add quality gates. Husky pre-commit hooks. ESLint. A GitHub Actions CI workflow that runs the full test suite before merge. PR #117 — the least glamorous and most important PR of the sprint.

Builds still break sometimes. But less. And when they do, the CI catches it before I waste time installing a broken build on my phone.

The numbers

MetricFeb 14Mar 26
Total PRs merged~20134
App.tsx3,385 lines1,155 lines
Tests~120973
Test files~1586
Source lines~8,00021,261
FeaturesVoice + basic chatVoice, images, docs, search, haptics, push notifications, streaming STT, markdown, conversation export, swipe nav, barge-in, custom instructions

What actually works

Batches of five. Not three (too slow), not ten (too many conflicts). Five tasks per round is the sweet spot — enough parallelism, manageable conflicts.

The 5-minute heartbeat. The cron job that wakes the lead every 5 minutes is what makes the whole thing async. I throw tasks, go do something else, come back to a notification that says “all 5 merged, server running, ready to test.” That decoupling is everything.

Testing, not reviewing. I found more bugs by using the app for 5 minutes than I ever found reading diffs. The iOS audio routing bug — TTS playing through the earpiece at whisper volume because allowsRecordingIOS: true messes with the audio session — I found that on my bike, not in a code review.

Letting the lead iterate. When a PR isn’t good enough, the lead sends it back to the dev agent with specific feedback. They go back and forth until it’s right. I’m not in that loop. I don’t need to be.

What doesn’t

I still lose my temper. “Arrête de poser des questions bêtes.” “Maintiens ce doc à jour putain.” “UX P0 are not done wtf.” The agents are patient. I’m not always.

Context limits. Dev agents hit their context window mid-task and just stop. The lead learned to break big refactors into smaller steps, but it still happens.

Bookkeeping. The tracking gist needs updating after every merge. The lead forgets sometimes. At this pace, losing track of what’s done vs. in-flight is the real risk.

The realization

Six weeks ago, the insight was “I’m the bottleneck.” This week’s insight is different: the bottleneck was never speed — it was role confusion.

I was trying to be the architect, the reviewer, the merger, and the tester. Now I’m just the tester and the architect. The AI handles the middle. It reviews code better than I do at midnight anyway.

I still make every product decision. “Is barge-in a tap or a press?” “Should auto-scroll stop when you scroll up?” Those calls take 5 seconds each, but there are dozens of them per day, and no agent will guess them. That’s the job now: make the calls, test the result, throw the next batch.

57 PRs in 6 days. I didn’t read most of them. The app has never been better.

— P