230 Checks, 0 Critical: How We Audit a 5,000-Line Feature with AI

The Feature

sh0's Mail MVP turns a single CLI command into a full-fledged email hosting platform. Stalwart Mail Server, DKIM key generation, DNS verification, Cloudflare auto-configuration, mailbox and alias management -- all wrapped in a 4-tab dashboard with 5-language i18n support.

The implementation spans ~5,000 lines across 30 files: a SQL migration, 3 Rust models, DKIM crypto, a Docker container manager, a Stalwart REST API client, DNS verification via dig, Cloudflare DNS extensions, 15 API handlers with RBAC, and a Svelte 5 dashboard with a setup wizard, detail page, and full CRUD modals.

The Problem with AI-Generated Code

Each AI session optimizes locally. Session 1 builds the database layer. Session 2 builds the API. Session 3 builds the dashboard. Each session produces working code -- but does the dashboard's TypeScript interface match the Rust response struct? Does the DNS record format string in mail_crypto.rs match what the setup wizard displays? Does the DnsStatus enum serialize as "pass" or "Pass"?

Cross-layer consistency is where bugs hide. The individual pieces work fine in isolation; the integration is where things break.

The Methodology

We use a build-audit-audit-approve pipeline:

Build sessions (3 sessions): Each focused on one layer. Infrastructure, API, Dashboard.
Focused audits (2 rounds): Each audit session reviews one build session. Session 2's audit found 4 Important issues (dig timeouts, Docker orphan cleanup, partial Cloudflare failure logging, optional alias addresses). Session 3's audit found 3 Important issues (hardcoded English strings, wrong empty-state i18n keys, untranslated status text).
Global audit (this session): A fresh context reads every file and checks the system as a whole.

What the Global Audit Found

We defined 230 checklist items across 19 sections:

Schema correctness: 11 checks (column types, constraints, indexes, foreign keys)
Model layer: 13 checks (from_row mappings, CRUD methods, serde annotations)
Crypto: 10 checks (DKIM keygen, DNS format strings, error handling)
Docker: 14 checks (image, ports, volumes, labels, idempotency, cleanup)
Stalwart client: 11 checks (auth, endpoints, error handling, timeouts)
DNS verification: 12 checks (6 record types, injection prevention, timeouts)
Cloudflare: 9 checks (MX/TXT creation, partial failure handling, TTL)
API handlers: 30 checks (15 endpoints, RBAC, encryption, validation, audit logging)
Router & OpenAPI: 6 checks
Request/Response types: 8 checks
TypeScript types & API client: 9 checks (field-for-field matching)
Dashboard pages: 51 checks (list, wizard, detail with 4 tabs)
i18n & French accents: 14 checks (accents are critical -- this is an educational platform)
Security: 12 checks (encryption, injection, XSS, secrets in responses)
Cross-layer consistency: 9 checks (DB == Model == API == TypeScript == Dashboard)
Previous fix verification: 7 checks (all 7 fixes from prior audits still in place)
Build verification: 4 checks

Results: 227 pass, 3 fail. 0 Critical. 2 Important (both fixed in-session).

The two Important findings were both i18n issues: hardcoded English strings that bypassed the translation system, and a missing created_at column in the mailbox table. Both fixed by adding 20 new i18n keys across 5 language files and updating 3 Svelte components.

Why This Works

The key insight is diverse perspectives at the right granularity:

Build sessions optimize for correctness within their layer
Focused audits catch bugs within that layer from a fresh perspective
The global audit catches cross-layer inconsistencies that no single-layer audit can see

The global audit found issues that both focused audits missed -- hardcoded English in error fallbacks, the missing table column. These are the kind of bugs that slip through when each reviewer only sees their own slice.

The Numbers

Metric	Value
Total lines of code	~5,000
Files touched	30
Build sessions	3
Audit rounds	3 (2 focused + 1 global)
Total findings across all audits	10
Critical findings	0
Important findings fixed	9
Minor findings	4
Checklist items verified	230
Languages with correct accents	5

The cost of the audit pipeline is ~30 minutes of AI compute. The cost of shipping a mail hosting feature with a DKIM key leak, a command injection in dig, or "Boites aux lettres" without the circumflex to an educational platform? Much higher.

230 Checks, 0 Critical: How We Audit a 5,000-Line Feature with AI

The Feature

The Problem with AI-Generated Code

The Methodology

What the Global Audit Found

Why This Works

The Numbers

Responses

Related Articles

senndo, Day Zero: The Full Fable 5 Harness, Wired Before the First Line of Code

Don't Make the Founder Open Chrome

The Agents That Arrived After The Commit