File management in a web application creates a problem that most frameworks ignore: orphaned files. A user uploads a profile photo, then changes it. The old photo is still on disk, consuming storage, referenced by nothing. A product listing is deleted, but its associated images remain. Over time, these orphaned files accumulate, consuming disk space and potentially exposing sensitive data that should have been deleted.
Session 237 completed FLIN's garbage collection system by integrating it with the CLI (for manual sweeps) and the HTTP server (for automatic reference tracking). This was the final piece of the FM-7 milestone -- compression and garbage collection -- bringing it to 100% completion across all 8 tasks.
The Blob Reference Problem
FLIN's file upload system stores files as blobs -- binary large objects -- in configurable storage backends (local filesystem, S3, R2, GCS). Each blob has a unique identifier (a hash of its content plus a timestamp). When a FLIN entity references a file, the entity's field stores the blob identifier.
The problem arises when references change. Consider this sequence:
1. User uploads photo.jpg -- stored as blob abc123, referenced by User#7.avatar.
2. User uploads new-photo.jpg -- stored as blob def456, User#7.avatar now references def456.
3. Blob abc123 is no longer referenced by any entity. It is orphaned.
Without garbage collection, blob abc123 lives on disk forever. With thousands of users updating their profiles, the storage cost of orphaned blobs grows continuously.
Reference Tracking
The garbage collection system maintains a reference index: a mapping from blob identifiers to the entities that reference them. When an entity is saved, its file fields are scanned and the references are recorded. When an entity is destroyed, its references are removed.
pub struct BlobRefIndex {
refs: HashMap<String, Vec<BlobRef>>, // blob_id -> list of references
orphans: HashMap<String, Instant>, // blob_id -> time became orphaned
}#[derive(Debug, Clone, Serialize, Deserialize)] pub struct BlobRef { entity_type: String, entity_id: u64, field_name: String, }
impl BlobRefIndex { pub fn add_ref(&mut self, blob_id: &str, entity_ref: BlobRef) { self.refs.entry(blob_id.to_string()) .or_default() .push(entity_ref);
// If this blob was previously orphaned, un-orphan it self.orphans.remove(blob_id); }
pub fn remove_ref(&mut self, blob_id: &str, entity_type: &str, entity_id: u64) { if let Some(refs) = self.refs.get_mut(blob_id) { refs.retain(|r| !(r.entity_type == entity_type && r.entity_id == entity_id));
if refs.is_empty() { self.refs.remove(blob_id); // Mark as orphaned with current timestamp self.orphans.insert(blob_id.to_string(), Instant::now()); } } }
pub fn find_orphans(&self, grace_period: Duration) -> Vec<&str> { self.orphans.iter() .filter(|(_, orphaned_at)| orphaned_at.elapsed() > grace_period) .map(|(blob_id, _)| blob_id.as_str()) .collect() } } ```
The reference index is persisted to disk as .flindb/blob_refs.json. It is loaded when the VM starts and saved on every entity save, destroy, and checkpoint operation. This ensures that reference tracking survives server restarts.
HTTP Integration: Automatic Tracking
The VM's save and destroy opcodes now automatically update the blob reference index. When an entity is saved, the VM scans its fields for file references and adds them to the index. When an entity is destroyed, the references are removed.
// In VM::execute_save()
fn execute_save(&mut self) -> Result<(), RuntimeError> {
let entity = self.pop_entity()?;
let id = self.storage.save(&entity)?;// Track blob references if let Some(ref mut blob_index) = self.blob_ref_index { let file_fields = self.schema .get_file_fields(&entity.entity_type);
for field in file_fields { if let Some(Value::Text(blob_id)) = entity.get(&field.name) { blob_index.add_ref(blob_id, BlobRef { entity_type: entity.entity_type.clone(), entity_id: id, field_name: field.name.clone(), }); } }
blob_index.save_to_disk()?; }
self.push(Value::Int(id as i64)); Ok(()) }
// In VM::execute_destroy() fn execute_destroy(&mut self) -> Result<(), RuntimeError> { let entity = self.pop_entity()?;
// Get file paths BEFORE destroying the entity let file_paths = self.storage .destroy_with_cleanup(&entity.entity_type, entity.id)?;
// Remove blob references if let Some(ref mut blob_index) = self.blob_ref_index { for blob_id in &file_paths { blob_index.remove_ref( blob_id, &entity.entity_type, entity.id, ); }
blob_index.save_to_disk()?; }
self.push(Value::Bool(true)); Ok(()) } ```
The critical detail in execute_destroy is that file paths are extracted before the entity is deleted. If we extracted them after, the entity would already be gone and the file references would be unrecoverable. The destroy_with_cleanup method returns the list of blob identifiers that were referenced by the destroyed entity.
The Grace Period
Orphaned blobs are not deleted immediately. They enter a grace period (default: 1 hour) during which they can be re-referenced. This handles a common pattern: a user uploads a new avatar, the old avatar becomes orphaned, but then the user clicks "undo" and the old avatar is restored. Without the grace period, the old avatar would already be deleted.
The grace period also provides a safety net for transient failures. If the reference index fails to update during a save operation (disk full, for example), the blob is temporarily orphaned. The grace period gives the system time to recover and update the index correctly.
CLI Integration: The flin gc Command
Session 237 added the flin gc command for manual garbage collection. In production, operators need to inspect the state of blob storage, preview what would be deleted, and execute sweeps on their own schedule.
# Show GC status: total blobs, referenced, orphaned, reclaimable space
$ flin gc
Blob Storage Status:
Total blobs: 1,247
Referenced: 1,189
Orphaned: 58
Reclaimable: 234 MB
Grace period: 1 hour
Orphans past grace period: 41# Preview what would be deleted (dry run) $ flin gc --sweep --dry-run Would delete 41 orphaned blobs (198 MB): abc123.jpg (4.2 MB, orphaned 3h ago) def456.pdf (12.1 MB, orphaned 2h ago) ghi789.png (1.8 MB, orphaned 5h ago) ... (38 more)
# Execute the sweep $ flin gc --sweep Deleted 41 orphaned blobs, reclaimed 198 MB
# Custom grace period (5 minutes instead of 1 hour) $ flin gc --sweep --grace-period 300
# Verbose output: show each blob being deleted $ flin gc --sweep -v ```
The --dry-run flag is essential for production use. It lets operators verify what will be deleted before any data is permanently removed. The verbose flag (-v) shows each individual blob, useful for debugging when specific files should not be deleted.
The implementation in src/main.rs added the Gc variant to the Commands enum with the following options:
Commands::Gc {
path: PathBuf,
sweep: bool,
dry_run: bool,
grace_period: u64,
verbose: bool,
}The cmd_gc() handler function (approximately 150 lines) loads the blob reference index, scans the storage backend for all blobs, computes the orphan list, and either displays the status or executes the sweep.
The Format Bytes Helper
A small but important detail: the CLI output formats byte sizes in human-readable units. Nobody wants to see "198,234,567 bytes" -- they want to see "198 MB."
fn format_bytes(bytes: u64) -> String {
const KB: u64 = 1024;
const MB: u64 = KB * 1024;
const GB: u64 = MB * 1024;if bytes >= GB { format!("{:.1} GB", bytes as f64 / GB as f64) } else if bytes >= MB { format!("{:.1} MB", bytes as f64 / MB as f64) } else if bytes >= KB { format!("{:.1} KB", bytes as f64 / KB as f64) } else { format!("{} B", bytes) } } ```
Small utility functions like this are the difference between a developer tool and a production tool. Production tools are used by operators who need to make decisions quickly, and clear output formatting enables that.
Checkpoint Integration
The blob reference index is also saved during database checkpoints. This ensures consistency between the database state and the reference index:
// In VM::checkpoint()
fn checkpoint(&mut self) -> Result<(), RuntimeError> {
self.storage.checkpoint()?;if let Some(ref blob_index) = self.blob_ref_index { blob_index.save_to_disk()?; }
Ok(()) } ```
If the server crashes between a save and a checkpoint, the WAL replay on restart will re-execute the save operations, which will re-add the blob references. The reference index might temporarily be out of date, but the grace period ensures no blobs are prematurely deleted.
Test Results
Session 237 added 10 new tests across two files:
CLI tests (6):
- Parsing of flin gc command with default options
- Parsing with --sweep flag
- Parsing with --dry-run flag
- Parsing with --grace-period custom value
- Parsing with -v verbose flag
- Parsing with all options combined
VM integration tests (4): - VM with storage has a blob reference index - VM without storage (in-memory) has no blob reference index - Blob reference index persists across VM restarts - Checkpoint saves the blob reference index
The total test count after Session 237 reached 3,537: 2,920 library tests and 617 integration tests. All passing.
FM-7 Milestone Complete
Session 237 completed the FM-7 milestone -- Compression and Garbage Collection -- at 100%:
| Task | Description | Status |
|---|---|---|
| FM7-01 | Zstd compression | Complete |
| FM7-02 | Compression CLI | Complete |
| FM7-03 | Compression statistics | Complete |
| FM7-04 | Decompression | Complete |
| FM7-05 | GC infrastructure | Complete |
| FM7-06 | GC orphan detection | Complete |
| FM7-07 | GC CLI integration | Complete |
| FM7-08 | HTTP GC integration | Complete |
Eight tasks, eight completions. The file management system -- from upload to storage to compression to garbage collection -- was fully operational. A FLIN application could accept file uploads, store them in any of four backends, compress them for efficient storage, track references automatically, and clean up orphaned files either manually via the CLI or automatically via the grace period mechanism.
This is the kind of infrastructure that most web frameworks leave to the developer. In FLIN, it is built in.
---
This is Part 188 of the "How We Built FLIN" series, documenting how a CEO in Abidjan and an AI CTO designed and built a programming language from scratch.
Series Navigation: - [187] Search Result Caching - [188] GC, CLI, and HTTP Integration Testing (you are here) - [189] Tracking Sync and State Management