Back to sh0
sh0

The Backup Engine That Never Backed Up

We built a complete backup engine with 13 storage providers and AES-256 encryption. Then we clicked "Backup Now" and nothing happened. Here is everything that was broken.

Thales & Claude | March 27, 2026 9 min sh0
backuprusttokiodockerpostgresdebuggingarchitecturedevops

We had a beautiful backup engine. AES-256-GCM encryption. 13 storage providers. Chunked compression. A restore pipeline. A scheduler with cron expressions. The dashboard had a backup page with modals, a CronBuilder component, storage provider configuration, and three-step wizards for both triggering and scheduling backups.

Then a user clicked "Backup Now." The button spun. A toast said "Backup triggered." The backup appeared in the history list with a yellow "pending" badge.

And it stayed pending. Forever.

The backup engine existed. The scheduling infrastructure existed. The UI existed. But no one had ever wired the trigger to the engine. The trigger_backup API handler created a database record with status: "pending" and returned 202 Accepted. It never called BackupEngine::create_backup(). The record sat in SQLite, pending, until someone noticed.

This is the story of everything that was broken, and how we fixed it in one session.

Bug 1: The Handler That Did Nothing

Here is what the handler looked like before the fix:

rustpub async fn trigger_backup(
    State(state): State<AppState>,
    auth: AuthUser,
    Json(body): Json<TriggerBackupRequest>,
) -> Result<(StatusCode, Json<serde_json::Value>)> {
    let backup = Backup {
        id: uuid::Uuid::new_v4().to_string(),
        source_type: body.source_type,
        source_id: body.source_id,
        status: "pending".to_string(),
        // ...
    };

    backup.insert(&pool)?;

    Ok((StatusCode::ACCEPTED, Json(response)))
}

That is the entire function. It validates the request, inserts a row into the backups table, and returns. The BackupEngine -- the component that actually runs pg_dump, compresses, encrypts, and stores the result -- is never invoked.

The fix was to add execute_existing_backup() to BackupEngine: a method that takes an already-inserted backup record and runs the full pipeline against it, updating the status from "pending" to "running" to "completed" (or "failed"). The handler now spawns this as a background task:

rustlet engine = state.backup_engine.clone();
let pool = state.pool.clone();

tokio::spawn(async move {
    if let Err(e) = engine.execute_existing_backup(&backup_id, config).await {
        tracing::error!(backup_id = %backup_id, error = %e, "Triggered backup failed");
    }
});

The API still returns 202 Accepted immediately -- the backup runs asynchronously. The client can poll the backup status to see progress.

Bug 2: The Scheduler That Was Never Spawned

The BackupScheduler struct existed. It had a tick() method that queried for enabled schedules, checked if their next_run had passed, and triggered backups. The scheduler had unit tests. The processing guard prevented concurrent runs. The cron normalization handled 5-field, 6-field, and 7-field expressions.

But BackupScheduler was never instantiated. In main.rs, the CronScheduler (for cron jobs like webhooks and deployments) was spawned as a background task. The BackupScheduler was not. No one had added the spawn block.

rust// This existed for cron jobs:
let cron_handle = tokio::spawn(async move {
    let mut interval = tokio::time::interval(Duration::from_secs(60));
    loop {
        interval.tick().await;
        scheduler.tick(&pool, &docker).await;
    }
});

// This did NOT exist for backups.

The fix was adding an identical block after the CronScheduler spawn, creating a BackupScheduler with its own BackupEngine instance and ticking every 60 seconds.

Bug 3: Volume Backups Hit the Host Filesystem

When a user triggered a volume backup, the engine tried to create a tar archive of the volume path. The backup_volume() function called tar::Builder::append_dir_all(".", path) on the host filesystem.

But Docker volumes are not accessible as host paths. The path /var/lib/postgresql/data exists inside the container, not on the macOS or Linux host. The function would fail with "Volume path does not exist."

The fix was to use Docker's archive API instead:

rust// Before: host filesystem tar (broken)
backup_volume(Path::new(&path))

// After: Docker archive API (correct)
docker.copy_from_container(container_id, path).await

Docker's GET /containers/{id}/archive endpoint returns a tar archive of any path inside the container. It is binary-safe (unlike exec_in_container which parses stdout as UTF-8 text and corrupts binary data) and purpose-built for this use case.

We added copy_from_container() to the Docker client and a matching copy_to_container() for restore. The volume backup now works for any container, regardless of what filesystem the volume uses internally.

Bug 4: pg_dump Tried to Dump "flin-postgres"

This was the subtlest bug. When backing up a database-type app (like a PostgreSQL instance deployed via template), the handler used the app name as the database name:

rustBackupSource::Database {
    db_name: app.name,  // "flin-postgres" -- the app name, not the database!
    // ...
}

The actual database inside the container is named whatever POSTGRES_DB was set to in the template -- typically "app", "0cron", or "postgres". The app name "flin-postgres" is just a human-readable label in the dashboard.

The result: pg_dump -U postgres flin-postgres fails with FATAL: database "flin-postgres" does not exist.

The fix was to read the app's encrypted environment variables from the database, decrypt them with the master key, and extract the real database credentials:

rustlet env_vars = EnvVar::list_by_app_id(&pool, &app.id)?;
let env_map = decrypt_env_map(&master_key, &env_vars);

let (db_name, db_user, db_password) = match engine.as_str() {
    "postgres" | "postgresql" => {
        let name = env_map.get("POSTGRES_DB")
            .cloned()
            .unwrap_or_else(|| "postgres".into());
        let user = env_map.get("POSTGRES_USER")
            .cloned()
            .unwrap_or_else(|| "postgres".into());
        let pass = env_map.get("POSTGRES_PASSWORD").cloned();
        (name, user, pass)
    }
    "mysql" | "mariadb" => {
        let name = env_map.get("MYSQL_DATABASE")
            .cloned()
            .unwrap_or_else(|| "app".into());
        // ...
    }
    // mongodb, redis, etc.
};

The credentials handler for the dashboard already solved this problem -- the get_db_credentials endpoint reads POSTGRES_DB from env vars to show the "Database Credentials" card. We just needed to reuse the same pattern in the backup handler.

Bug 5: Volume Path Defaulted to "/data"

When the scheduler built a BackupConfig for a volume backup, it looked up the app's mounts to find the volume path. If no mounts were found in the database (which happens when the template deployment flow does not persist mount records), it fell back to /data.

PostgreSQL stores data in /var/lib/postgresql/data. MySQL uses /var/lib/mysql. MongoDB uses /data/db. Redis uses /data. The hardcoded /data fallback only worked for Redis.

The fix was a stack-aware default:

rustfn default_volume_path(stack: &str) -> &'static str {
    match stack {
        "postgres" | "postgresql" | "timescaledb" | "pgvector"
            => "/var/lib/postgresql/data",
        "mysql" | "mariadb" | "tidb" => "/var/lib/mysql",
        "mongodb" | "mongo" => "/data/db",
        "redis" | "dragonfly" | "keydb" | "valkey" => "/data",
        "clickhouse" => "/var/lib/clickhouse",
        "cassandra" | "scylladb" => "/var/lib/cassandra",
        "influxdb" => "/var/lib/influxdb2",
        _ => "/data",
    }
}

This maps each database engine to its standard data directory. When both the schedule.path and mount records are unavailable, the system knows where PostgreSQL, MySQL, and MongoDB store their data by convention.

Bug 6: "0 Databases Available"

The backup page showed "0 databases available" even when the user had a running PostgreSQL instance. The reason: database apps deployed via templates are stored in the apps table, not the databases table. The backup page only queried databasesApi.list(), which returned formal Database records. Template-deployed databases were invisible.

The fix was to detect database-type apps by their stack field and show them alongside formal databases:

typescriptconst DUMPABLE_STACKS = new Set([
    'postgres', 'mysql', 'mariadb', 'mongodb', 'redis',
    'timescaledb', 'pgvector'
]);

const dbApps = $derived(
    apps.filter((a) => a.stack && DUMPABLE_STACKS.has(a.stack.toLowerCase()))
);

const totalDbSources = $derived(databases.length + dbApps.length);

The backend also needed a fallback: when source_type is "database" but Database::find_by_id fails, try App::find_by_id and derive the engine from the app's stack field.

What the Session Produced

In one session, we shipped:

FixImpact
Trigger handler calls engineBackups actually execute
BackupScheduler spawnedScheduled backups fire
Docker archive API for volumesVolume backups work in containers
Env var lookup for db_namepg_dump uses real database name
Stack-aware default pathsVolume paths correct per engine
Database-type app detection"0 databases" shows actual count
Download endpointUsers can download completed backups
Delete confirmation modalNo more accidental deletions
Schedule cards UISchedules readable at a glance
Edit schedule modalModify cron/retention/destination
Run Now confirmationPrevent accidental triggers

Eleven fixes. The backup engine went from "looks complete in code review" to "actually works end-to-end."

The Lesson

A backup engine that was never triggered is worse than no backup engine at all. It creates false confidence. The dashboard showed a "Backup Now" button. The schedule form accepted cron expressions. The storage providers page let you configure S3 buckets. Everything looked like it worked. The only missing piece was the connection between the button and the engine -- a tokio::spawn call that nobody wrote.

This is a pattern we see in every complex system: the glue code is invisible in architecture diagrams but essential in production. The pipeline stages were individually correct. The API handler was correct (it returned 202). The scheduler was correct (it had tests). But the integration -- the single line that calls engine.execute_existing_backup() -- was missing.

Integration testing would have caught this. A single end-to-end test that triggers a backup and asserts the final status is "completed" would have failed immediately. We had unit tests for the scheduler's processing guard and the volume archiver's path traversal protection. We did not have a test that clicked the button and checked if a file appeared on disk.

The audit methodology (build, audit, audit, approve) caught architecture and security issues. But the auditors read code, not runtime behavior. The handler looked reasonable -- it created a record and returned 202. The auditor had no reason to search for a missing tokio::spawn in a function that appeared complete.

The real fix is not just the code. It is adding integration tests that verify the end-to-end flow: trigger a backup, wait for completion, verify the file exists, download it, and confirm the contents match. Those tests are next.

Next in the series: When pg_dump Cannot Find Your Database -- how template-deployed databases store credentials, why app.name is not db_name, and the env var decryption pipeline that connects the backup engine to the credentials system.

Share this article:

Responses

Write a response
0/2000
Loading responses...

Related Articles