#201 -- The File Storage Marathon: 30 Sessions

File management is one of those features that seems simple until you start building it. Upload a file, save it somewhere, serve it back. Three steps. Except that "upload" means multipart parsing with RFC 2046 compliance, streaming for large files, SHA-256 integrity verification, path traversal protection, and size validation. "Save it somewhere" means local storage, cloud storage (S3, R2, GCS), content-addressable storage, compression, garbage collection, and preview generation. "Serve it back" means download grants, access control, range requests, and caching headers.

FLIN's file management system was built across 30 sessions -- Sessions 212 through 243 -- between January 20 and January 22, 2026. It went from 3% complete to 100% complete in three days, ultimately comprising 75 tasks across 7 categories. And then it kept going, adding document parsing, semantic search over files, and RAG (Retrieval-Augmented Generation) integration.

The Audit: Session 212

Like the temporal marathon before it, the file storage work began with an audit. Session 212 examined the tracking document and discovered that it was significantly out of date. The tracking claimed 2/74 tasks (3%) complete. The actual state was 19/74 (26%).

File Management Status (Session 212):

FM-1: File Upload HTTP    Tracking: 0/12    Actual: 12/12  (100%)
FM-2: File Field Type     Tracking: 0/8     Actual: 5/8    (63%)
Total:                    Tracking: 2/74    Actual: 19/74  (26%)

The entire FM-1 category -- multipart parsing, file validators, save_file() -- had been implemented in Session 194 during the security sprint but the tracking document was never updated. This was a recurring pattern: implementation outpacing documentation. After Session 212, the team committed to updating tracking in the same session as implementation.

What Session 212 also revealed was the scope of what remained. Multipart parsing and basic upload were done. But storage backends, document parsing, semantic search, compression, garbage collection, and preview generation were all ahead.

Sessions 213-218: Storage Backends

The first major task was building the storage backend abstraction. FLIN needed to support multiple storage destinations -- local filesystem for development, cloud object storage (S3, Cloudflare R2, Google Cloud Storage) for production -- through a single, unified interface.

rust// The StorageBackend trait (Session 214)
pub trait StorageBackend: Send + Sync {
    fn store(&self, key: &str, data: &[u8], metadata: &FileMetadata)
        -> Result<StorageResult, StorageError>;

    fn retrieve(&self, key: &str)
        -> Result<Vec<u8>, StorageError>;

    fn delete(&self, key: &str)
        -> Result<(), StorageError>;

    fn exists(&self, key: &str)
        -> Result<bool, StorageError>;

    fn metadata(&self, key: &str)
        -> Result<FileMetadata, StorageError>;
}

Session 214 defined the trait and implemented the local filesystem backend. Session 215 implemented the S3/R2 backend using presigned URLs for direct upload. Session 216 added Google Cloud Storage. Session 217 introduced download grants -- time-limited, signed URLs that allow controlled access to private files without exposing storage credentials.

flin// Download grants in FLIN (Session 217)
route GET "/files/:id/download" {
    guard auth

    file = File.find(params.id)
    grant = download_grant(file.storage_key, {
        expires_in: 3600,        // 1 hour
        content_disposition: "attachment",
        filename: file.original_name
    })

    redirect(grant.url)
}

Session 218 added access key management and cleanup utilities. By this point, FLIN supported local storage for development (zero configuration), S3-compatible storage for AWS and Cloudflare (one configuration block), and GCS for Google Cloud. Developers choose their backend with a single setting; the API is identical regardless of storage destination.

Sessions 219-227: Document Parsing and Semantic Search

This is where file storage became genuinely interesting. Storing and retrieving files is table stakes. Parsing documents, extracting text, chunking content, generating embeddings, and enabling semantic search over file contents -- that transforms files from opaque blobs into searchable knowledge.

Sessions 219-220 built document parsers for PDF and DOCX formats. Session 221 implemented intelligent text chunking -- splitting documents into overlapping segments that preserve context:

rust// Text chunking with overlap (Session 221)
pub struct ChunkConfig {
    max_chunk_size: usize,     // 512 tokens default
    overlap: usize,            // 50 tokens overlap
    separator: ChunkSeparator, // Paragraph, sentence, or fixed
}

pub fn chunk_text(text: &str, config: &ChunkConfig) -> Vec<TextChunk> {
    let segments = split_by_separator(text, config.separator);
    let mut chunks = Vec::new();
    let mut current = String::new();
    let mut current_tokens = 0;

    for segment in segments {
        let segment_tokens = count_tokens(&segment);
        if current_tokens + segment_tokens > config.max_chunk_size {
            chunks.push(TextChunk {
                text: current.clone(),
                start_offset: chunks.len() * config.max_chunk_size,
                token_count: current_tokens,
            });
            // Keep overlap from end of current chunk
            current = get_last_n_tokens(&current, config.overlap);
            current_tokens = config.overlap;
        }
        current.push_str(&segment);
        current_tokens += segment_tokens;
    }
    // ... handle final chunk
    chunks
}

Session 222 integrated chunking with embedding generation. When a document is uploaded to a semantic file field, FLIN automatically extracts text, chunks it, generates embeddings for each chunk using FastEmbed (the local embedding model built in Session 172), and indexes the chunks for search.

flin// Semantic file search in FLIN (Session 223)
entity Document {
    title: text
    content: semantic file   // Auto-parsed, auto-chunked, auto-embedded
    uploaded: time = now
}

// Upload a PDF -- extraction happens automatically
save Document { title: "FLIN Specification", content: uploaded_file }

// Search across all document contents
results = search "temporal database queries" in Document by content
for result in results {
    print(result.entity.title)
    print(result.score)
}

Session 223 enabled semantic file search. Session 224 added hybrid search (combining keyword BM25 and semantic vector search). Session 225 added search analytics. Session 226 implemented result caching. Session 227 built the chunk-to-file mapping that lets search results point back to the original document and specific page numbers.

Sessions 228-235: Format Support and Infrastructure

Sessions 228-231 expanded document parsing to cover every common format:

Session	Formats Added	Parser
228	CSV, XLSX	Tabular data extraction
229	JSON, YAML	Structured data extraction
230	RTF	Rich text extraction
231	XML with XPath	Structured markup extraction

Session 232 added automatic semantic conversion -- when a plain file field is changed to semantic file, existing files are automatically re-processed. Session 233 implemented zstd compression for stored files, reducing storage costs. Session 234 built blob garbage collection -- a background process that identifies and removes orphaned file blobs that are no longer referenced by any entity.

rust// Garbage collection for file blobs (Session 234)
pub fn gc_orphaned_blobs(
    storage: &dyn StorageBackend,
    db: &Database,
) -> Result<GcReport, GcError> {
    let all_blobs = storage.list_keys()?;
    let referenced = db.all_file_references()?;

    let orphaned: Vec<_> = all_blobs
        .iter()
        .filter(|key| !referenced.contains(key))
        .collect();

    let mut deleted = 0;
    let mut freed_bytes = 0;

    for key in &orphaned {
        if let Ok(meta) = storage.metadata(key) {
            freed_bytes += meta.size;
        }
        storage.delete(key)?;
        deleted += 1;
    }

    Ok(GcReport { deleted, freed_bytes, total_checked: all_blobs.len() })
}

Session 235 built preview generation for common file types -- thumbnails for images, first-page renders for PDFs. Session 236 integrated previews with the HTTP server, making them accessible via standard URLs.

Sessions 237-243: RAG and Completion

Sessions 237-238 handled tracking synchronization and integration testing. Session 239 added code-aware chunking -- a specialized chunker that understands programming language syntax and splits code files at function and class boundaries rather than arbitrary token counts.

Sessions 240-242 built the RAG (Retrieval-Augmented Generation) pipeline:

flin// RAG with the "ask about" syntax (Session 240)
entity Contract {
    title: text
    document: semantic file
    client: text
}

// Natural language queries over document contents
answer = ask about Contract
    "What are the payment terms for the Acme contract?"

// The pipeline:
// 1. Embed the question
// 2. Search document chunks for relevant passages
// 3. Retrieve top-K chunks with reranking
// 4. Send question + context to LLM
// 5. Return attributed answer with source references

Session 241 implemented top-K retrieval with cross-encoder reranking -- a two-stage search where an initial fast search retrieves candidates, then a more precise model reranks them for relevance. Session 242 added source attribution, so every AI-generated answer includes references to the specific document chunks it drew from.

Session 243 -- the final session of the marathon -- completed score ranking for search results. The search keyword now returns structured results with both the entity and a normalized relevance score:

flinresults = search "quarterly revenue" in Report by content
for result in results {
    print("${result.entity.title}: score ${result.score}")
    // "Q3 Financial Report: score 0.92"
    // "Annual Summary: score 0.67"
}

With Session 243, File Management reached 75/75 tasks. One hundred percent complete.

The Final Tally

File Management: 75/75 (100%)

FM-1: File Upload HTTP       12/12   Multipart, validators, save_file
FM-2: File Field Type         8/8    File type, properties, semantic
FM-3: Storage Backends       16/16   Local, S3, R2, GCS, grants
FM-4: Document Parsing       13/13   PDF, DOCX, CSV, JSON, YAML, RTF, XML
FM-5: Chunking & RAG         10/10   Chunking, embedding, RAG pipeline
FM-6: Semantic File Search    8/8    Hybrid search, analytics, scoring
FM-7: Compression & GC        8/8    Zstd, preview, garbage collection

Sessions: 212-243 (30 sessions)
Days: 3 (January 20-22)
Tests added: ~700 (reaching 3,620 total)

The test count growth during the file storage marathon was dramatic -- from approximately 2,929 tests at the start to 3,620 tests by Session 243. Nearly 700 new tests in 30 sessions, covering multipart parsing, storage backends, document extraction, chunking algorithms, search ranking, and garbage collection.

What 30 Sessions for File Storage Means

File management is often treated as infrastructure -- boring but necessary plumbing that sits below the "real" features. The file storage marathon demonstrates that this plumbing, when done well, becomes a differentiating feature.

Most web frameworks give you file upload and leave the rest to you. You find a storage library, a document parsing library, a search library, and an embedding library. You glue them together with configuration files and hope they interoperate. When they do not, you debug the integration layer that no one designed and no one owns.

FLIN gives you semantic file as a field type. Upload a PDF, and it is automatically parsed, chunked, embedded, and indexed. Search it with search "query" in Entity by content. Ask questions about it with ask about Entity "question". The entire pipeline -- from multipart HTTP upload to AI-powered question answering -- is built into the runtime.

Thirty sessions. Three days. A file management system that most organizations would spec as a quarter-long project. The CEO + AI CTO model makes this possible not because the code is simpler, but because the feedback loop is faster. Each session produces working, tested code. The next session builds on it immediately. There is no waiting for code review, no context-switching between projects, no meetings to align on API design. The momentum is continuous.

This is Part 201 of the "How We Built FLIN" series, documenting how a CEO in Abidjan and an AI CTO built a programming language from scratch.

Series Navigation: - [200] The Security Sprint: 18 Sessions - [201] The File Storage Marathon: 30 Sessions (you are here) - [202] The Admin Console From Scratch

#201 -- The File Storage Marathon: 30 Sessions

The Audit: Session 212

Sessions 213-218: Storage Backends

Sessions 219-227: Document Parsing and Semantic Search

Sessions 228-235: Format Support and Infrastructure

Sessions 237-243: RAG and Completion

The Final Tally

What 30 Sessions for File Storage Means

Responses

Related Articles

Thirteen Agents, Forty-Three Minutes: The First Claude Fable 5 Workflow Session, And What A Deterministic Orchestration Script Changes About Multi-Agent Builds

The gate caught its own drift: one day inside CASP with Claude Fable 5

The CASP Transplant: How The Six-File Discipline Moved From Conductor To An Anti-Fraud Transport ERP, What The /next Skill Adds When The Operator Just Types 'next', And Why The Cost Of CASP Drift Rises When The Project Is Someone Else's Cash