Blue-Green Deploys: Building a Zero-Downtime Pipeline in Rust

A deployment pipeline is a promise. It promises that when a developer pushes code, something predictable will happen: their app will be analyzed, built, tested, routed, and made live -- or the failure will be clear and recoverable. Break that promise even once and trust evaporates.

We built sh0's deploy pipeline as a single async Rust function -- approximately 350 lines that orchestrate eight discrete steps from git clone to live traffic. The pipeline supports four source types (Git repository, Dockerfile, Docker image, file upload), performs blue-green container swaps with zero downtime, and includes a disk management system born from watching too many servers run out of space.

This is the story of those 350 lines.

The Eight Steps

Every deployment in sh0 flows through the same eight-step sequence. The steps are the same regardless of source type; only the early steps differ in implementation.

Clone --> Analyze --> Build --> Deploy --> Health Check --> Route --> Swap --> Finalize

Each step updates the deployment record in the database with its status and appends structured log lines that the dashboard parses into a progress bar. The append_step_log() helper writes markers like [STEP 3/6] Building Docker image... that the frontend's parseDeploySteps() function picks up automatically.

Let us walk through each step.

Step 1: Clone

For Git deployments, this calls GitRepo::clone_or_pull() into repos_dir/{app_id}/. If the repository already exists locally, it pulls the latest changes rather than cloning from scratch. For Docker image deploys, this step is a no-op. For file uploads, the archive has already been extracted.

Step 2: Analyze

The sh0_builder::check() function produces a health report for the source code. It detects the stack (Node.js, Python, Go, Rust, static site, or raw Dockerfile), checks for a valid Dockerfile or generates one, validates the project structure, and assigns a confidence score out of 100.

If the score falls below 60, the deploy fails immediately. This is a deliberate gate: building an image that will certainly fail wastes minutes of compute and disk space. Better to fail fast with a clear message like "No package.json found and no Dockerfile present" than to let Docker build stumble through errors.

Step 3: Build

This is where Docker takes over. The builder constructs an image tagged sh0-{app_name}:{commit_short} using the detected or provided Dockerfile. Build logs stream to the deployment record in real time.

The image tag format matters for the cleanup system we will discuss later -- the app name prefix lets us identify and prune old images on a per-app basis.

Step 4: Deploy

A new container is created from the built image and started on the sh0-net Docker network. The container configuration includes environment variables (fetched from the database), resource limits, and health check definitions.

Here is the critical detail: the new container runs alongside the old one. Both are live on the Docker network simultaneously. This is the "blue-green" in blue-green deploys -- the new container (green) starts while the old container (blue) continues serving traffic.

Step 5: Health Check

The pipeline waits for the new container to become healthy before routing traffic to it. The health check strategy depends on the container's configuration:

rustasync fn wait_for_healthy(
    docker: &DockerClient,
    container_id: &str,
    timeout: Duration,
) -> Result<()> {
    let deadline = Instant::now() + timeout;

    loop {
        if Instant::now() > deadline {
            return Err(DeployError::HealthCheckTimeout);
        }

        let state = docker.inspect_container(container_id).await?;

        match state.health {
            Some(health) if health.status == "healthy" => return Ok(()),
            Some(health) if health.status == "unhealthy" => {
                return Err(DeployError::HealthCheckFailed);
            }
            Some(_) => {
                // Starting -- keep waiting
                tokio::time::sleep(Duration::from_secs(2)).await;
            }
            None => {
                // No HEALTHCHECK defined -- verify running for 5s
                if state.running && state.started_at.elapsed() > Duration::from_secs(5) {
                    return Ok(());
                }
                tokio::time::sleep(Duration::from_secs(1)).await;
            }
        }
    }
}

If the container has a Docker HEALTHCHECK instruction, we poll until it reports healthy (or timeout after 60 seconds). If there is no health check, we use a simpler heuristic: verify the container has been running for at least five seconds without crashing. This handles the common case of a misconfigured app that crashes immediately on startup.

Step 6: Route

Once the health check passes, we update the Caddy reverse proxy to point at the new container. The pipeline inspects the new container for its IP address on the sh0-net network and calls proxy.set_app_route() with the updated upstream:

rustlet container_info = docker.inspect_container(&new_container_id).await?;
let ip = container_info
    .network_settings
    .networks
    .get("sh0-net")
    .ok_or(DeployError::NoNetworkIp)?
    .ip_address
    .clone();

let route = AppRoute {
    domains: app_domains.clone(),
    upstream: format!("{}:{}", ip, app.port),
};

// Soft error -- deploy succeeded even if routing fails temporarily
match proxy.set_app_route(&app.id, route).await {
    Ok(()) => append_step_log(&pool, deploy_id, "Routes updated"),
    Err(e) => {
        tracing::error!("Route update failed (non-fatal): {}", e);
        append_step_log(&pool, deploy_id, "Route update failed -- will retry");
    }
}

Notice the soft error handling. A routing failure does not fail the deploy. The container is running and healthy; the health monitor will eventually re-apply the route. This was Fix 5 from the reliability cascade described in the previous article.

Step 7: Swap

Now we decommission the old container. This is where zero-downtime deploys become real:

rustif let Some(old_container_id) = previous_container_id {
    // Give the old container a 30-second grace period
    docker.stop_container(&old_container_id, Duration::from_secs(30)).await?;
    docker.remove_container(&old_container_id).await?;
}

The 30-second grace period is a SIGTERM followed by a wait. This gives the old container time to finish processing in-flight requests, close database connections, and shut down gracefully. After 30 seconds, Docker sends SIGKILL. Because Caddy already routes new traffic to the green container (Step 6), the only requests hitting the blue container are ones that were in-flight when the route switched.

Step 8: Finalize

The pipeline updates the app record with the new container ID and image ID, marks the deployment as succeeded, and cleans up the build directory. The entire sequence -- from git clone to live traffic -- typically completes in 30 to 90 seconds depending on the Docker build complexity.

The Deploy Status State Machine

A deployment in sh0 moves through a defined set of states:

pending --> cloning --> analyzing --> building --> deploying
    --> health_checking --> routing --> swapping --> succeeded
                                                  \
                                                   --> failed

Any step can transition to failed. The state is stored in the database and exposed through the API, so the dashboard can show real-time progress. Each transition also appends a structured log line, giving users a detailed audit trail of what happened and when.

The state machine is not just for UI -- it drives the rollback mechanism too. When a user triggers a rollback, the system looks up the target deployment's image_id and creates a new deployment that reuses that image. The pipeline detects the pre-existing image and skips the clone, analyze, and build steps, jumping straight to deploy. This means rollbacks complete in seconds rather than minutes.

The Disk Cleanup System

We call this the "anti-Coolify" feature. We watched Coolify (an open-source PaaS) fill server disks with orphaned Docker images and build cache, eventually crashing the entire platform. We refused to ship the same bug.

The cleanup system has three layers: proactive checking, periodic cleanup, and smart retention.

Pre-Deploy Disk Check

Before every deploy, the pipeline checks available disk space using the statvfs system call:

rustfn check_disk_space(path: &Path) -> Result<DiskStatus> {
    let stat = nix::sys::statvfs::statvfs(path)?;
    let total = stat.blocks() * stat.fragment_size();
    let available = stat.blocks_available() * stat.fragment_size();
    let usage_pct = ((total - available) as f64 / total as f64) * 100.0;

    if usage_pct > 90.0 {
        Err(DeployError::DiskFull {
            usage: format!("{:.1}%", usage_pct),
        })
    } else {
        if usage_pct > 80.0 {
            tracing::warn!("Disk usage at {:.1}% -- approaching capacity", usage_pct);
        }
        Ok(DiskStatus { usage_pct, available_gb: available as f64 / 1e9 })
    }
}

Above 90% usage, the deploy fails immediately with a clear error message. Between 80% and 90%, it proceeds but logs a warning. This fail-fast behavior prevents the catastrophic scenario where a Docker build fills the last gigabyte of disk space and takes down the entire server.

Periodic Background Cleanup

A background tokio task runs every six hours (configurable via --cleanup-interval-hours, or disable with 0). It performs three operations:

Prune stopped containers -- removes any stopped container with the sh0-managed label
Prune dangling images -- removes untagged, unused images left over from multi-stage builds
Prune build cache -- clears Docker's builder cache

Smart Per-App Image Retention

The most sophisticated piece is prune_old_app_images(keep_per_app). It lists all images with the sh0- prefix, groups them by app name, sorts by creation date, and removes everything except the N most recent images per app (default 3, configurable via --cleanup-keep-images):

rustpub async fn prune_old_app_images(
    docker: &DockerClient,
    keep_per_app: usize,
) -> Result<PruneReport> {
    let images = docker.list_images_with_prefix("sh0-").await?;

    // Group by app name (sh0-{app_name}:{tag})
    let mut by_app: HashMap<String, Vec<ImageInfo>> = HashMap::new();
    for img in images {
        if let Some(app_name) = img.tag.split(':').next() {
            by_app.entry(app_name.to_string()).or_default().push(img);
        }
    }

    let mut removed = 0;
    for (app, mut imgs) in by_app {
        imgs.sort_by(|a, b| b.created.cmp(&a.created));
        for old in imgs.into_iter().skip(keep_per_app) {
            docker.remove_image(&old.id).await?;
            removed += 1;
        }
    }

    Ok(PruneReport { images_removed: removed })
}

This means each app keeps its three most recent images (for rollback) while everything older is automatically cleaned up. The naming convention sh0-{app_name}:{commit_short} makes the grouping trivial.

The Rollback Mechanism

Rollbacks in sh0 are deployments in disguise. When a user hits the rollback button, the API creates a new deployment record pointing at the target deployment's image_id:

POST /api/v1/deployments/:id/rollback

This spawns the standard pipeline, which detects that the Docker image already exists and skips the clone/analyze/build steps. The deploy proceeds from Step 4: start a new container from the existing image, run health checks, update routes, swap out the current container. A rollback typically completes in under ten seconds.

The beauty of this approach is that rollbacks use exactly the same code path as fresh deploys. There is no separate "rollback logic" to maintain and no edge cases where rollbacks behave differently from deploys. The pipeline is the pipeline.

Pipeline Variants

The eight-step pipeline adapts to four deployment source types:

Source Type	Steps	Notes
Git Repository	6 steps	Clone, analyze, build, deploy, health check, route
Dockerfile	5 steps	Build, deploy, health check, route, complete
Docker Image	4 steps	Pull, deploy, health check, route
File Upload	5 steps	Analyze, build, deploy, health check, route

Each variant writes its own step markers ([STEP 1/6], [STEP 1/4], etc.) so the dashboard progress bar scales correctly. The core logic -- deploy, health check, route, swap -- is shared across all variants.

Concurrency and Error Handling

Deploys run as spawned tokio tasks, not inline with the API request. When a user triggers a deploy (or a webhook fires), the API endpoint creates the deployment record, sets its status to pending, and spawns the pipeline:

rusttokio::spawn(async move {
    if let Err(e) = run_pipeline(state, app_id, deploy_id).await {
        tracing::error!("Deploy {} failed: {}", deploy_id, e);
        // Update deployment status to failed with error message
        update_deploy_status(&pool, deploy_id, "failed", &e.to_string()).await;
    }
});

The API returns immediately with the deployment ID. The dashboard polls for status updates, rendering the progress bar from the step markers in the build log.

This architecture means multiple deploys can run concurrently for different apps. The RwLock on the proxy route state ensures they do not corrupt each other's routing, and the Docker client handles concurrent container operations natively.

Lessons Learned

The pipeline should be the only code path. Rollbacks, webhook deploys, manual deploys, and API-triggered deploys all flow through the same function. This eliminates an entire class of bugs where "rollback works differently" or "webhooks skip the health check."

Fail fast on disk space. The statvfs check adds negligible latency but prevents the single most common PaaS failure mode: a full disk. Every self-hosted platform eventually runs out of space. The question is whether it fails gracefully or catastrophically.

Keep N images, not N days. Time-based retention (e.g., "delete images older than 7 days") is unpredictable -- a rarely deployed app might have its only working image deleted. Count-based retention ("keep the 3 most recent per app") guarantees rollback capability regardless of deploy frequency.

What Comes Next

The deploy pipeline was humming. The proxy was routing traffic. Everything worked -- for about five minutes at a time. Then Caddy would freeze, the health monitor would kill it, routes would re-apply, and five minutes later it would freeze again. The next article tells the story of the 16KB bug: a classic Unix pipe deadlock hiding in our modern Rust codebase.

This is Part 6 of the "How We Built sh0.dev" series. sh0 is a PaaS platform built entirely by a CEO in Abidjan and an AI CTO, with zero human engineers.