Backup Engine: AES-256-GCM, 13 Storage Providers, and FTP Nightmares

There is a particular class of infrastructure failure that no monitoring dashboard can prevent: the one where your data simply ceases to exist. A corrupt disk. A botched migration. A DROP TABLE executed against production at 2 AM by someone who thought they were in staging. The only defense is backups, and backups are only as good as the storage they land on and the encryption that protects them.

We built a backup engine that could dump PostgreSQL, MySQL, and MongoDB databases, archive Docker volumes, compress everything with gzip, encrypt it with AES-256-GCM, and ship it to any of 13 storage providers -- from a local directory to AWS S3 to a Hetzner Storage Box over FTPS. Then we discovered that FTP over IPv6 is broken in ways that required us to write our own client.

The Backup Pipeline

The backup engine followed a linear pipeline: dump, compress, encrypt, store, record. Each stage was a separate module with a single responsibility:

Database/Volume -> dump -> compress (gzip) -> encrypt (AES-256-GCM) -> store -> DB record

The BackupEngine orchestrator called each stage in sequence. If any stage failed, the pipeline halted and the backup record was marked as failed with the error message. No partial backups were left on storage. No unencrypted data was written to disk.

rustpub async fn execute_backup(
    &self,
    source: &BackupSource,
    destination: &BackupDestination,
    master_key: &MasterKey,
) -> Result<BackupRecord> {
    let raw_data = match source {
        BackupSource::Database { app_id, db_type } =>
            self.dump.execute(app_id, db_type).await?,
        BackupSource::Volume { app_id, volume } =>
            self.volume.archive(app_id, volume).await?,
    };

    let compressed = self.compress.gzip(&raw_data)?;
    let encrypted = self.encryption.encrypt(&compressed, master_key)?;

    let path = self.storage.upload(&encrypted, destination).await?;

    Ok(BackupRecord {
        source: source.clone(),
        destination: destination.clone(),
        path,
        size_bytes: encrypted.len() as i64,
        encrypted: true,
        created_at: Utc::now(),
    })
}

Database Dumps via Docker Exec

For database backups, we did not install PostgreSQL or MySQL client binaries on the host. Instead, we executed dump commands inside the running database containers using the Docker exec API:

rust// PostgreSQL: pg_dump via Docker exec
let output = docker.exec_in_container(
    container_id,
    &["pg_dump", "-U", &user, "-d", &database, "--format=custom"],
).await?;

// MySQL: mysqldump via Docker exec
let output = docker.exec_in_container(
    container_id,
    &["mysqldump", "-u", &user, &format!("-p{}", password), &database],
).await?;

// MongoDB: mongodump with --archive for single-stream output
let output = docker.exec_in_container(
    container_id,
    &["mongodump", "--archive", "--db", &database],
).await?;

This approach had two advantages. First, no host dependencies -- the backup engine worked on any server without installing database clients. Second, version matching -- the dump tool inside the container always matched the database version, eliminating the common problem where a pg_dump version mismatch produces corrupted backups.

Volume backups used tar to create compressed archives of entire Docker volume mount points. The archive was streamed into the same pipeline: compress, encrypt, store.

AES-256-GCM: Encryption That Cannot Be Downgraded

Every backup was encrypted before leaving the server. We chose AES-256-GCM because it provides both confidentiality and integrity verification in a single operation. If a backup file is tampered with on storage, the decryption will fail rather than silently produce corrupted data.

For large backups, we implemented chunked encryption with 4 MB chunks, each with its own randomly generated nonce:

rustpub fn upload_encrypted(
    data: &[u8],
    master_key: &MasterKey,
    backend: &StorageBackend,
    path: &str,
) -> Result<()> {
    const CHUNK_SIZE: usize = 4 * 1024 * 1024; // 4 MB

    let mut offset = 0;
    let mut chunk_index: u32 = 0;

    while offset < data.len() {
        let end = (offset + CHUNK_SIZE).min(data.len());
        let chunk = &data[offset..end];

        // Per-chunk nonce prevents nonce reuse across chunks
        let nonce = generate_nonce();
        let encrypted_chunk = aes_256_gcm_encrypt(chunk, master_key, &nonce)?;

        let chunk_path = format!("{}.chunk_{:06}", path, chunk_index);
        backend.write(&chunk_path, &encrypted_chunk).await?;

        offset = end;
        chunk_index += 1;
    }

    Ok(())
}

Per-chunk nonces were essential. AES-GCM is catastrophically broken if the same nonce is reused with the same key. By generating a fresh random nonce for each 4 MB chunk, we eliminated the risk entirely, even for multi-gigabyte backups.

13 Storage Providers via OpenDAL

The backup engine needed to support diverse storage backends. Some users want S3. Some want Backblaze B2 for the cost savings. Some have a Hetzner Storage Box they are already paying for. Some want to keep backups on the same server in a different directory.

We used Apache OpenDAL as a unified abstraction layer. OpenDAL provides a single Operator interface for reading, writing, listing, and deleting files across dozens of storage backends:

rustpub fn build_operator(config: &StorageConfig) -> Result<Operator> {
    match config {
        StorageConfig::Local { path } => {
            let builder = Fs::default().root(path);
            Ok(Operator::new(builder)?.finish())
        }
        StorageConfig::S3 { provider, bucket, region, access_key, secret_key, .. } => {
            let mut builder = S3::default()
                .bucket(bucket)
                .region(region)
                .access_key_id(access_key)
                .secret_access_key(secret_key);

            // Provider-specific endpoint overrides
            match provider {
                S3Provider::Cloudflare => builder = builder
                    .endpoint(&format!("https://{}.r2.cloudflarestorage.com", account_id)),
                S3Provider::DigitalOcean => builder = builder
                    .endpoint(&format!("https://{}.digitaloceanspaces.com", region)),
                S3Provider::Backblaze => builder = builder
                    .endpoint(&format!("https://s3.{}.backblazeb2.com", region)),
                S3Provider::Wasabi => builder = builder
                    .endpoint(&format!("https://s3.{}.wasabisys.com", region)),
                S3Provider::Hetzner => builder = builder
                    .endpoint("https://fsn1.your-objectstorage.com"),
                S3Provider::Aws | S3Provider::MinIO | S3Provider::Generic => {}
            }

            Ok(Operator::new(builder)?.finish())
        }
        StorageConfig::Sftp { host, port, username, .. } => {
            let builder = Sftp::default()
                .endpoint(&format!("{}:{}", host, port))
                .user(username);
            Ok(Operator::new(builder)?.finish())
        }
        // FTP and FTPS handled separately (see below)
        _ => Err(anyhow!("Unsupported provider")),
    }
}

The 13 providers fell into three categories:

Category	Providers
Local	Local filesystem
S3-compatible	AWS S3, Cloudflare R2, DigitalOcean Spaces, Backblaze B2, Wasabi, MinIO, Hetzner Object Storage, Generic S3
File transfer	SFTP, FTP, FTPS
Cloud drives	Dropbox, Google Drive

Each provider had its own default endpoint and region configuration. The S3Provider enum encoded the differences so that users only needed to provide their bucket name and credentials -- the endpoint URL was derived automatically.

The FTP Nightmare: IPv6, PASV, and TLS SNI

Everything worked beautifully until we tested FTP uploads to a Hetzner Storage Box. The connection failed with a cryptic error:

421 Could not listen for passive connection: invalid passive IP "[2a01"

The root cause was an intersection of three problems:

Problem 1: IPv6 and PASV. The Hetzner Storage Box DNS resolved to an IPv6 address. OpenDAL's FTP backend used the PASV command, which is IPv4-only. The server tried to return an IPv6 address in PASV format, which truncated it at the first colon, producing the garbage [2a01.

Problem 2: No EPSV support. The fix for PASV on IPv6 is EPSV (Extended Passive Mode). But OpenDAL's FTP backend did not expose a way to enable EPSV. The suppaftp library underneath supported it, but OpenDAL's abstraction layer did not pass the option through.

Problem 3: TLS SNI hostname. Even if we forced IPv4 resolution, OpenDAL used the same string for both the TCP connection address and the TLS Server Name Indication (SNI) hostname. If we resolved the hostname to an IPv4 address and passed the IP directly, TLS certificate verification would fail because the certificate was issued for u563760.your-storagebox.de, not for 123.45.67.89.

Transmit (a macOS FTP client) worked fine because it used EPSV by default and handled SNI correctly. Our code could not because OpenDAL's abstraction prevented us from controlling these low-level details.

The Solution: Our Own FTP Client

We bypassed OpenDAL entirely for FTP and FTPS. Using the suppaftp library directly, we built a dedicated FTP client that handled all three problems:

rustpub struct FtpClient {
    host: String,
    port: u16,
    username: String,
    password: String,
    use_tls: bool,
}

impl FtpClient {
    pub async fn connect(&self) -> Result<AsyncNativeTlsFtpStream> {
        let addr = format!("{}:{}", self.host, self.port);
        let mut stream = AsyncNativeTlsFtpStream::connect(&addr).await?;

        if self.use_tls {
            // TLS with correct SNI hostname
            stream = stream.into_secure(
                AsyncNativeTlsConnector::from(TlsConnector::new()?),
                &self.host,  // Original hostname for SNI
            ).await?;
        }

        stream.login(&self.username, &self.password).await?;

        // EPSV mode -- works with both IPv4 and IPv6
        stream.set_mode(Mode::ExtendedPassive);

        Ok(stream)
    }
}

The StorageBackend was refactored with an internal Engine enum that routed operations to either OpenDAL or the custom FTP client:

rustenum Engine {
    OpenDal(Operator),
    Ftp(FtpClient),
}

impl StorageBackend {
    pub async fn write(&self, path: &str, data: &[u8]) -> Result<()> {
        match &self.engine {
            Engine::OpenDal(op) => op.write(path, data.to_vec()).await?,
            Engine::Ftp(client) => client.write(path, data).await?,
        }
        Ok(())
    }
}

The dashboard was also updated: when the user switched the storage provider type to FTP, the default port field changed from 22 (SFTP) to 21 (FTP). A small detail, but one that prevented a guaranteed connection failure for every FTP user who did not notice the wrong default.

Scheduling and Retention

Backups without a schedule are backups that do not happen. The BackupScheduler used cron expressions to trigger backups at user-configured intervals:

0 2 * -- daily at 2 AM
0 /6  -- every 6 hours
0 0 0 -- weekly on Sunday

Each schedule had a retention count. When a new backup completed, the retention pruner listed existing backups for that schedule, sorted by timestamp, and deleted the oldest ones that exceeded the limit. If you configured "keep last 7 daily backups," the eighth backup would trigger deletion of the first.

The scheduler ran as a background task with next-run tracking. On each tick, it checked which schedules were due, executed their backups, and computed the next run time. Backup records were stored in the database with status tracking (pending, running, completed, failed) so the dashboard could show real-time progress.

The Storage Provider API

Storage providers were managed through a full CRUD API with encrypted configuration storage. When a user added an S3 provider, the access key and secret key were encrypted with the instance's master key before being stored in the database:

POST   /api/v1/storage-providers          -- Create provider
POST   /api/v1/storage-providers/test     -- Test connection
GET    /api/v1/storage-providers          -- List providers
GET    /api/v1/storage-providers/:id      -- Get provider
PATCH  /api/v1/storage-providers/:id      -- Update provider
DELETE /api/v1/storage-providers/:id      -- Delete provider
POST   /api/v1/storage-providers/:id/default -- Set as default

The test endpoint performed a write/read/delete probe -- uploading a small test file, reading it back, and deleting it. This verified not just connectivity but actual read/write permissions. The response DTO never exposed config_encrypted, ensuring that credentials were write-only through the API.

The Final Count

The backup system spanned three crates (sh0-backup, sh0-db, sh0-api), one database migration, 33 passing tests, and a dashboard page with provider cards, test connection buttons, and default provider selection. It supported 13 storage providers, AES-256-GCM encryption with chunked uploads, cron-based scheduling with retention pruning, and database dumps for three database engines.

And it had a custom FTP client -- because sometimes, the only way to make something work is to go around the abstraction that is supposed to make it easy.

Next in the series: Autoscaling in Rust: CPU Thresholds, Cooldowns, and Load Balancing -- how we built horizontal scaling with replica management, Caddy load balancing, and an autoscaler that watches CPU and memory metrics.

Backup Engine: AES-256-GCM, 13 Storage Providers, and FTP Nightmares

The Backup Pipeline

Database Dumps via Docker Exec

AES-256-GCM: Encryption That Cannot Be Downgraded

13 Storage Providers via OpenDAL

The FTP Nightmare: IPv6, PASV, and TLS SNI

The Solution: Our Own FTP Client

Scheduling and Retention

The Storage Provider API

The Final Count

Responses

Related Articles

Thirteen Agents, Forty-Three Minutes: The First Claude Fable 5 Workflow Session, And What A Deterministic Orchestration Script Changes About Multi-Agent Builds

The gate caught its own drift: one day inside CASP with Claude Fable 5

The CASP Transplant: How The Six-File Discipline Moved From Conductor To An Anti-Fraud Transport ERP, What The /next Skill Adds When The Operator Just Types 'next', And Why The Cost Of CASP Drift Rises When The Project Is Someone Else's Cash