Taming Caddy as a Programmatic Reverse Proxy

Every PaaS needs a reverse proxy. It is the front door -- the component that accepts HTTP traffic from the internet, terminates TLS, and routes requests to the correct container. Get it wrong and nothing works. Get it right and users never think about it. We needed to get it right in a single afternoon.

This is the story of how we turned Caddy from a standalone web server into a fully programmatic reverse proxy managed entirely through Rust, and how a cascade of five reliability fixes turned a fragile integration into a production-grade routing layer.

Why Caddy Over Nginx or Traefik

The decision took about ten minutes. We needed three things from a reverse proxy: automatic HTTPS via ACME, a runtime configuration API, and minimal operational overhead for a self-hosted PaaS.

Nginx was immediately out. It requires config file generation and a reload signal for every route change. That means templating, file writes, and nginx -s reload calls -- plus the risk of generating an invalid config that takes down all routing at once. For a platform that adds and removes routes on every deploy, this is brittle.

Traefik was a serious contender. It has a robust API and native Docker integration. But its configuration model is complex -- labels, middlewares, entrypoints, routers, services -- and the Docker provider wants to own the container lifecycle. We were already managing containers ourselves through sh0-docker. Layering Traefik's Docker provider on top would have created two systems fighting over the same containers.

Caddy hit the sweet spot. Its Admin API accepts a full JSON configuration via POST /load, it handles ACME certificate provisioning out of the box, and it runs as a single static binary. No config files, no templating, no reload signals. Just HTTP calls.

The architecture is clean:

Internet --> Caddy (:80/:443) --> Docker containers (172.18.0.x:port)
                  ^
          sh0-proxy manages via Admin API (localhost:2019)

Managing Caddy as a Child Process

Caddy runs as a child process of the sh0 server. This is a deliberate choice -- we want full lifecycle control without depending on systemd or any external process manager.

The CaddyProcess struct in process.rs handles spawning, stopping, and health checking:

rustpub struct CaddyProcess {
    child: Option<tokio::process::Child>,
    caddy_path: PathBuf,
}

impl CaddyProcess {
    pub async fn start(&mut self) -> Result<()> {
        // Kill any stale Caddy from a previous run
        kill_stale_caddy().await;

        let child = Command::new(&self.caddy_path)
            .args(["run", "--config", "-"])
            .stdin(Stdio::null())
            .stdout(Stdio::null())
            .stderr(Stdio::piped())
            .spawn()?;

        // Drain stderr in a background task (critical -- see Article 7)
        if let Some(stderr) = child.stderr.take() {
            tokio::spawn(async move {
                let reader = BufReader::new(stderr);
                let mut lines = reader.lines();
                while let Ok(Some(line)) = lines.next_line().await {
                    tracing::debug!(target: "caddy", "{}", line);
                }
            });
        }

        self.child = Some(child);
        Ok(())
    }

    pub async fn stop(&mut self) -> Result<()> {
        if let Some(ref child) = self.child {
            // Graceful SIGTERM first
            unsafe { libc::kill(child.id().unwrap() as i32, libc::SIGTERM); }
            // Wait up to 5s, then SIGKILL
            tokio::time::timeout(Duration::from_secs(5), child.wait()).await
                .unwrap_or_else(|_| { child.kill(); Ok(()) });
        }
        self.child = None;
        Ok(())
    }
}

The ensure_running() method is the heartbeat. It checks whether the child process is still alive and whether the Admin API responds to pings. If either check fails, it restarts Caddy and returns a boolean indicating that a restart occurred -- a signal that the route state needs to be re-applied.

The Admin API Client

The CaddyClient in caddy.rs wraps Caddy's Admin API with typed Rust methods. The core operation is load_config -- sending a complete JSON configuration to Caddy:

rustpub struct CaddyClient {
    client: reqwest::Client,
    admin_url: String,  // http://localhost:2019
}

impl CaddyClient {
    pub async fn load_config(&self, config: &CaddyConfig) -> Result<()> {
        let url = format!("{}/load", self.admin_url);
        let mut attempts = 0;
        let max_retries = 3;

        loop {
            match self.client.post(&url).json(config).send().await {
                Ok(resp) if resp.status().is_success() => return Ok(()),
                Ok(resp) if resp.status().is_client_error() => {
                    // Bad config -- fail immediately, no retry
                    return Err(ProxyError::CaddyConfigError(resp.text().await?));
                }
                Ok(_) | Err(_) if attempts < max_retries => {
                    attempts += 1;
                    let delay = Duration::from_millis(500 * 2u64.pow(attempts - 1));
                    tracing::warn!("Caddy load_config retry {}/{} in {:?}",
                        attempts, max_retries, delay);
                    tokio::time::sleep(delay).await;
                }
                Err(e) => return Err(e.into()),
                Ok(resp) => return Err(ProxyError::CaddyServerError(resp.status())),
            }
        }
    }

    pub async fn ping(&self) -> bool {
        self.client.get(format!("{}/reverse_proxy/upstreams", self.admin_url))
            .timeout(Duration::from_secs(2))
            .send().await
            .map(|r| r.status().is_success())
            .unwrap_or(false)
    }
}

The retry logic with exponential backoff (500ms, 1s, 2s) was added after we discovered that Caddy occasionally returns 5xx errors during route transitions, especially right after a restart. The critical distinction: 4xx errors (bad configuration) fail immediately -- retrying a malformed config is pointless and would only delay the error message.

RwLock-Based Route State

The ProxyManager is the orchestrator. It maintains the canonical set of routes in memory, protected by an RwLock, and rebuilds the full Caddy configuration on every change:

rustpub struct ProxyManager {
    process: Mutex<CaddyProcess>,
    client: CaddyClient,
    routes: RwLock<HashMap<String, AppRoute>>,
    custom_certs: RwLock<Vec<CustomCert>>,
    email: RwLock<String>,
    config: ProxyConfig,
}

impl ProxyManager {
    pub async fn set_app_route(&self, app_id: &str, route: AppRoute) -> Result<()> {
        {
            let mut routes = self.routes.write().await;
            routes.insert(app_id.to_string(), route);
        }
        self.rebuild_and_load().await
    }

    pub async fn remove_app_route(&self, app_id: &str) -> Result<()> {
        {
            let mut routes = self.routes.write().await;
            routes.remove(app_id);
        }
        self.rebuild_and_load().await
    }

    async fn rebuild_and_load(&self) -> Result<()> {
        let routes = self.routes.read().await;
        let certs = self.custom_certs.read().await;
        let email = self.email.read().await;
        let config = build_config_full(&routes, &certs, &email, &self.config);
        self.client.load_config(&config).await
    }
}

We chose to rebuild the entire Caddy configuration on every route change rather than using Caddy's granular PATCH endpoints. This is a conscious trade-off: full rebuilds are slightly less efficient but dramatically simpler to reason about. The complete route set is always the source of truth, and Caddy always receives a consistent, complete configuration. With dozens of routes (not thousands), the overhead is negligible.

The RwLock allows concurrent reads (for health checks, status queries) while ensuring exclusive access during writes. This matters because deploys can happen concurrently -- two users deploying different apps at the same time should not corrupt each other's route state.

The Background Health Monitor

A background tokio task runs every five seconds, checking Caddy's health and auto-recovering from crashes:

rust// In main.rs -- spawned after Caddy starts
let proxy_monitor = proxy.clone();
tokio::spawn(async move {
    let mut interval = tokio::time::interval(Duration::from_secs(5));
    loop {
        interval.tick().await;
        match proxy_monitor.ensure_running().await {
            Ok(restarted) if restarted => {
                tracing::warn!("Caddy was restarted -- routes re-applied");
            }
            Err(e) => {
                tracing::error!("Caddy health check failed: {}", e);
            }
            _ => {} // healthy, nothing to do
        }
    }
});

When ensure_running() detects that Caddy died or became unresponsive, it kills the process, starts a fresh one, waits 500ms for initialization, then rebuilds and re-applies the full configuration from the in-memory route state. From the user's perspective, there is a brief blip and then everything works again.

The Five-Fix Reliability Cascade

The initial Caddy integration worked in the happy path but crumbled under real-world conditions. Over the course of production hardening, we identified and fixed five distinct failure modes. Each fix addressed a specific scenario, and together they form a comprehensive reliability layer.

Fix 1: Kill Stale Caddy on Startup

The first production crash taught us an embarrassing lesson: if sh0 crashes and restarts, the old Caddy process is still running, holding port 2019. The new Caddy instance cannot bind the Admin API port, so all proxy operations fail silently.

The fix: before spawning a new Caddy process, kill_stale_caddy() runs at the top of CaddyProcess::start(). It tries a graceful POST /stop on the Admin API with a 2-second timeout, falls back to pkill -f "caddy run" if the API is unresponsive, and sleeps 500ms to let the port release. If the connection is refused (no stale process), it skips entirely.

Fix 2: Retry with Exponential Backoff

Caddy's Admin API occasionally returns transient errors during route updates, especially immediately after a restart. The retry logic described above (3 attempts, 500ms/1s/2s backoff) handles these gracefully. The key insight: only retry on connection errors and 5xx responses. A 4xx means the configuration is invalid, and retrying will never fix it.

Fix 3: Sync Routes from Database on Startup

When sh0 restarts, the in-memory route map is empty. Without route syncing, every running app becomes unreachable even though the containers are still running. On startup, before spawning the health monitor, we now load all apps with status == "running" from the database, inspect their Docker containers for IPs on the sh0-net network, build AppRoute structs, and call proxy.sync_routes(). Existing services are reachable immediately.

Fix 4: Re-apply Routes After Crash Recovery

When the health monitor detects a Caddy crash and restarts it, the fresh Caddy instance has no routes. The ensure_running() method now returns a boolean indicating whether a restart occurred. If it did, the ProxyManager waits 500ms for Caddy to initialize, then rebuilds the full configuration from the in-memory route map and re-applies it via load_config. The five-second health loop becomes a self-healing mechanism.

Fix 5: Soft Errors on Routing Failures

The final fix was philosophical as much as technical. When a deploy completes successfully -- the container is running and healthy -- but the Caddy route update fails, should we mark the deploy as failed? Initially, we did. But this was misleading: the app was running fine, just not routed. The fix: route failures are logged as errors but the deploy is marked as succeeded. The health monitor will eventually re-apply routes when Caddy recovers.

Lessons Learned

Building a programmatic proxy layer taught us three things:

Full config rebuilds beat incremental updates for small-scale systems. With fewer than a hundred routes, the simplicity of "rebuild everything and POST /load" far outweighs the complexity of tracking individual route patches. The entire config is always consistent, and debugging is trivial -- just log the JSON you sent to Caddy.

Child process management is harder than it looks. Stale processes, pipe buffer deadlocks (a story for the next article), port conflicts, and crash recovery all need explicit handling. A "spawn and forget" approach works until it does not, and when it fails, it fails catastrophically.

Self-healing beats alerting for infrastructure components. The health monitor that auto-restarts Caddy and re-applies routes is more valuable than any monitoring dashboard. Users do not care that Caddy crashed for three seconds if their app is back online in eight.

What Comes Next

The proxy layer was solid -- or so we thought. In the next article, we will walk through the eight-step deploy pipeline that ties together git cloning, Docker builds, health checks, and blue-green container swaps into a single atomic operation. And after that, we will tell you about the 16KB bug that nearly drove us to abandon Caddy entirely.

This is Part 5 of the "How We Built sh0.dev" series. sh0 is a PaaS platform built entirely by a CEO in Abidjan and an AI CTO, with zero human engineers.