#063 -- Transactions and Continuous Backup

A database without transactions is a database that will lose data. Not might. Will. The moment two operations need to succeed or fail together -- transferring money between accounts, creating an order with its line items, registering a user and their profile -- you need atomicity. Without it, a crash between the two operations leaves the database in an impossible state: money debited but not credited, an order without items, a user without a profile.

A database without backups is a database waiting to be the subject of a post-mortem. Hardware fails. Humans make mistakes. Software has bugs. The question is not whether you will need to restore data, but when.

Session 166 was a marathon. Four major feature areas in a single session: ACID transactions, backup and restore, graph queries, and semantic search. Ninety-four tests added. This article covers the first two -- transactions and backup -- because they are the foundation that makes FlinDB production-ready.

ACID Transactions

FlinDB's transaction system provides the four ACID guarantees:

Atomicity: All changes in a transaction succeed or none do
Consistency: The database moves from one valid state to another
Isolation: Concurrent transactions do not interfere
Durability: Committed changes survive crashes

The Transaction Struct

Every transaction is a self-contained unit of work:

rustpub struct Transaction {
    id: TransactionId,
    started_at: i64,
    timeout_ms: Option<u64>,
    state: TransactionState,
    savepoints: Vec<Savepoint>,
    pending_saves: Vec<PendingSave>,
    pending_deletes: Vec<PendingDelete>,
    read_versions: HashMap<(String, u64), u64>,
}

The pending_saves and pending_deletes fields are the key to atomicity. During a transaction, no changes are applied to the main data store. Instead, they are accumulated in these pending lists. Only when commit() is called are all changes applied at once. If rollback() is called, the pending lists are discarded and the database remains unchanged.

The read_versions field enables optimistic concurrency control. When a transaction reads an entity, it records the entity's version number. At commit time, ZeroCore checks whether any of these versions have changed. If another transaction modified an entity that this transaction read, the commit fails with a conflict error -- preventing lost updates.

Begin, Commit, Rollback

The transaction lifecycle is straightforward:

flindb.begin_transaction()

order = db.save("Order", { total: 0 })
item1 = db.save("OrderItem", { order_id: order.id, product: "Laptop" })
item2 = db.save("OrderItem", { order_id: order.id, product: "Mouse" })

db.commit()

If any save fails, the entire transaction can be rolled back:

flindb.begin_transaction()

try {
    db.save("Transfer", { from: account_a, to: account_b, amount: 1000 })
    db.save("Account", account_a.id, { balance: account_a.balance - 1000 })
    db.save("Account", account_b.id, { balance: account_b.balance + 1000 })
    db.commit()
} catch (e) {
    db.rollback()
}

The Rust implementation of commit applies all pending operations atomically:

rustpub fn commit_transaction(
    &mut self,
    txn_id: TransactionId,
) -> DatabaseResult<TransactionCommitResult> {
    let txn = self.transactions.remove(&txn_id)
        .ok_or(DatabaseError::TransactionNotFound)?;

    // Check for optimistic locking conflicts
    for ((entity_type, entity_id), read_version) in &txn.read_versions {
        let current_version = self.get_current_version(entity_type, *entity_id)?;
        if current_version != *read_version {
            return Err(DatabaseError::OptimisticLockConflict {
                entity_type: entity_type.clone(),
                entity_id: *entity_id,
            });
        }
    }

    // Apply all pending saves
    for save in txn.pending_saves {
        self.save(&save.entity_type, save.id, save.fields)?;
    }

    // Apply all pending deletes
    for delete in txn.pending_deletes {
        self.delete(&delete.entity_type, delete.id)?;
    }

    Ok(TransactionCommitResult {
        saves: txn.pending_saves.len(),
        deletes: txn.pending_deletes.len(),
    })
}

Savepoints

Savepoints allow partial rollback within a transaction. This is essential for complex workflows where you want to undo the last step without losing everything:

flindb.begin_transaction()
order = db.save("Order", { total: 0 })
db.create_savepoint("after_order")

try {
    db.save("OrderItem", { order_id: order.id, product: "Laptop" })
    db.commit()
} catch (e) {
    db.rollback_to_savepoint("after_order")
    db.save("Order", order.id, { status: "failed" })
    db.commit()
}

The rollback_to_savepoint() method discards all pending operations added after the savepoint was created, while keeping operations from before the savepoint.

Transaction Timeouts

Long-running transactions are dangerous. They hold resources, block other operations, and often indicate a programming error (a transaction that was begun but never committed). FlinDB supports configurable timeouts:

rustlet txn = db.begin_transaction_with_timeout(5000); // 5 seconds

If the transaction is not committed or rolled back within the timeout period, it is automatically rolled back. This prevents resource leaks from forgotten transactions.

Backup and Restore

With transactions providing atomicity, the backup system ensures durability beyond a single process lifetime. FlinDB supports three backup strategies: full, incremental, and continuous.

Full Backup

A full backup captures the complete database state -- all schemas, all entities, all version history:

rustlet options = BackupOptions::default();
Backup::full(&db, "backup.flindb.bak", options)?;

The backup file format uses Zstd compression:

.flindb.bak (Zstd compressed JSON)
+-- header: magic, version, type, timestamp, checksum
+-- metadata: schema count, entity count
+-- schemas: serialized EntitySchema[]
+-- entities: by type with full history

Why Zstd? We benchmarked three compression algorithms:

Algorithm	Compression Ratio	Speed
Zstd (level 3)	11% smaller than gzip	42% faster than Brotli
Gzip	Baseline	Baseline
Brotli	8% smaller than Zstd	42% slower than Zstd

Zstd offered the best trade-off: nearly the best compression ratio with significantly faster compression and decompression. For a backup system where both creation speed and restore speed matter, Zstd was the clear winner.

Every backup includes a SHA-256 checksum of the data payload. On restore, the checksum is verified before any data is applied. A corrupted backup file is rejected rather than silently loading garbled data.

Incremental Backup

Incremental backups capture only the changes since the last backup, using the WAL as the source of deltas:

rustlet options = BackupOptions::incremental(last_backup_version);
Backup::incremental(&db, "backup_incr.flindb.bak", options)?;

Incremental backups are smaller and faster than full backups, making them suitable for frequent backup intervals. To restore, you apply the last full backup followed by all subsequent incremental backups in order.

Point-in-Time Recovery

FlinDB supports restoring to a specific timestamp:

rustlet options = RestoreOptions::point_in_time(target_timestamp);
let db = Backup::restore("backup.flindb.bak", options)?;

Point-in-time recovery replays entity versions up to the specified timestamp, effectively rewinding the database to a past state. This is possible because FlinDB's temporal model preserves all versions -- the backup contains the complete history, and the restore process can stop at any point in that history.

Continuous Backup

Session 170 extended the backup system with continuous WAL streaming. Instead of periodic snapshots, the ContinuousBackup struct streams every WAL entry to a backup destination in real-time:

rustpub struct ContinuousBackup {
    source_wal: PathBuf,
    destination: BackupDestination,
    last_position: Arc<AtomicU64>,
    running: Arc<AtomicBool>,
    poll_interval: Duration,
}

The backup destination can be local or S3-compatible:

rustpub enum BackupDestination {
    Local(PathBuf),
    S3 {
        bucket: String,
        region: String,
        endpoint: Option<String>,
        prefix: String,
        access_key: String,
        secret_key: String,
    },
}

Continuous backup runs in a background thread, polling the WAL for new entries and streaming them to the destination:

rustlet backup = ContinuousBackup::new(wal_path, BackupDestination::local(dest_path))
    .with_poll_interval(Duration::from_millis(50))
    .with_start_position(1000);  // Resume from position

let handle = backup.start();
// Application runs...
backup.stop();
handle.join().unwrap();

The with_start_position() method enables resume capability. If the backup process is restarted, it picks up from where it left off rather than re-streaming the entire WAL. This is critical for production use where backup processes may be restarted during deployments.

For S3 destinations, entries are batched into 1 MB chunks before upload to minimize the number of S3 API calls and associated costs.

Backup Scheduling

The BackupScheduler automates periodic backups with retention policies:

rustlet scheduler = BackupScheduler::new(
    Duration::from_secs(3600),  // Every hour
    24,                          // Keep 24 backups
    "./backups",
)
.with_backup_type(BackupType::Full)
.with_compression(true);

let handle = scheduler.start(Arc::new(Mutex::new(db)));

The scheduler runs in a background thread, creating backups at the configured interval and enforcing the retention policy by deleting the oldest backups when the count exceeds the limit.

In FLIN configuration syntax, the backup setup is declarative:

flinapp {
    backup: {
        enabled: true
        continuous: {
            destination: "local"
            path: "./backups/"
        }
        schedule: {
            interval: "1h"
            retention: 24
            type: "full"
            compression: true
        }
    }
}

The Test Suite

Transactions and backup together account for 55 tests across Sessions 166 and 170:

Transaction tests (12): - Begin/commit/rollback lifecycle - Savepoint creation and partial rollback - Transaction timeout enforcement - Optimistic locking conflict detection - Commit result details

Backup tests (21 from Session 166): - Full backup creation and verification - Incremental backup creation - Zstd compression roundtrip - SHA-256 checksum verification - Point-in-time recovery - Restore configuration options

Continuous backup and scheduling tests (22 from Session 170): - BackupDestination validation (local and S3) - ContinuousBackup streaming and position tracking - BackupScheduler creation, retention, and cleanup - Resume capability after restart

Total tests after Session 170: 2,365 (1,748 library + 617 integration).

Why Both Transactions and Backups Matter

Transactions protect against application-level failures -- a crash during a multi-step operation. Backups protect against infrastructure-level failures -- disk corruption, accidental deletion, hardware failure.

Without transactions, a power loss during an order creation could leave an order without items. Without backups, a disk failure could lose all data permanently. Together, they form a complete durability story: transactions ensure consistency within a running system, and backups ensure recoverability when the system itself fails.

FlinDB provides both, with zero configuration required. Transactions are always available. The WAL provides crash recovery by default. And with a few lines of configuration, continuous backup and scheduled rotation ensure that data survives any failure.

This is Part 8 of the "How We Built FlinDB" series, documenting how we built a complete embedded database engine for the FLIN programming language.

Series Navigation: - [061] Index Utilization: Making Queries Fast - [062] Relationships and Eager/Lazy Loading - [063] Transactions and Continuous Backup (you are here) - [064] Graph Queries and Semantic Search - [065] The EAVT Storage Model