The Feature
sh0's Mail MVP turns a single CLI command into a full-fledged email hosting platform. Stalwart Mail Server, DKIM key generation, DNS verification, Cloudflare auto-configuration, mailbox and alias management -- all wrapped in a 4-tab dashboard with 5-language i18n support.
The implementation spans ~5,000 lines across 30 files: a SQL migration, 3 Rust models, DKIM crypto, a Docker container manager, a Stalwart REST API client, DNS verification via dig, Cloudflare DNS extensions, 15 API handlers with RBAC, and a Svelte 5 dashboard with a setup wizard, detail page, and full CRUD modals.
The Problem with AI-Generated Code
Each AI session optimizes locally. Session 1 builds the database layer. Session 2 builds the API. Session 3 builds the dashboard. Each session produces working code -- but does the dashboard's TypeScript interface match the Rust response struct? Does the DNS record format string in mail_crypto.rs match what the setup wizard displays? Does the DnsStatus enum serialize as "pass" or "Pass"?
Cross-layer consistency is where bugs hide. The individual pieces work fine in isolation; the integration is where things break.
The Methodology
We use a build-audit-audit-approve pipeline:
- Build sessions (3 sessions): Each focused on one layer. Infrastructure, API, Dashboard.
- Focused audits (2 rounds): Each audit session reviews one build session. Session 2's audit found 4 Important issues (dig timeouts, Docker orphan cleanup, partial Cloudflare failure logging, optional alias addresses). Session 3's audit found 3 Important issues (hardcoded English strings, wrong empty-state i18n keys, untranslated status text).
- Global audit (this session): A fresh context reads every file and checks the system as a whole.
What the Global Audit Found
We defined 230 checklist items across 19 sections:
- Schema correctness: 11 checks (column types, constraints, indexes, foreign keys)
- Model layer: 13 checks (from_row mappings, CRUD methods, serde annotations)
- Crypto: 10 checks (DKIM keygen, DNS format strings, error handling)
- Docker: 14 checks (image, ports, volumes, labels, idempotency, cleanup)
- Stalwart client: 11 checks (auth, endpoints, error handling, timeouts)
- DNS verification: 12 checks (6 record types, injection prevention, timeouts)
- Cloudflare: 9 checks (MX/TXT creation, partial failure handling, TTL)
- API handlers: 30 checks (15 endpoints, RBAC, encryption, validation, audit logging)
- Router & OpenAPI: 6 checks
- Request/Response types: 8 checks
- TypeScript types & API client: 9 checks (field-for-field matching)
- Dashboard pages: 51 checks (list, wizard, detail with 4 tabs)
- i18n & French accents: 14 checks (accents are critical -- this is an educational platform)
- Security: 12 checks (encryption, injection, XSS, secrets in responses)
- Cross-layer consistency: 9 checks (DB == Model == API == TypeScript == Dashboard)
- Previous fix verification: 7 checks (all 7 fixes from prior audits still in place)
- Build verification: 4 checks
Results: 227 pass, 3 fail. 0 Critical. 2 Important (both fixed in-session).
The two Important findings were both i18n issues: hardcoded English strings that bypassed the translation system, and a missing created_at column in the mailbox table. Both fixed by adding 20 new i18n keys across 5 language files and updating 3 Svelte components.
Why This Works
The key insight is diverse perspectives at the right granularity:
- Build sessions optimize for correctness within their layer
- Focused audits catch bugs within that layer from a fresh perspective
- The global audit catches cross-layer inconsistencies that no single-layer audit can see
The global audit found issues that both focused audits missed -- hardcoded English in error fallbacks, the missing table column. These are the kind of bugs that slip through when each reviewer only sees their own slice.
The Numbers
| Metric | Value |
|---|---|
| Total lines of code | ~5,000 |
| Files touched | 30 |
| Build sessions | 3 |
| Audit rounds | 3 (2 focused + 1 global) |
| Total findings across all audits | 10 |
| Critical findings | 0 |
| Important findings fixed | 9 |
| Minor findings | 4 |
| Checklist items verified | 230 |
| Languages with correct accents | 5 |
The cost of the audit pipeline is ~30 minutes of AI compute. The cost of shipping a mail hosting feature with a DKIM key leak, a command injection in dig, or "Boites aux lettres" without the circumflex to an educational platform? Much higher.