#183 -- Production Hardening Phase 3

A system that does not crash (Phase 1) and maintains consistent state (Phase 2) still has one critical requirement for production: it must be fast enough. Not fast in benchmarks -- fast in the context that matters. A web application that takes 800 milliseconds to render a page is functionally broken, even if it never crashes and never loses data.

Session 246 focused on the performance dimension of production readiness. We profiled FLIN under realistic workloads, identified bottlenecks, and systematically eliminated them. The work fell into three categories: memory optimization, compilation speed, and runtime execution.

Profiling Methodology

Before optimizing anything, we needed data. We constructed a benchmark application that exercises every major FLIN subsystem: entity CRUD, template rendering, search (BM25 and semantic), cron jobs, file uploads, and WebSocket connections. The application was seeded with 10,000 entities across 5 entity types, with full-text search indexes and foreign key relationships.

We used three profiling tools:

perf for CPU profiling -- identifying which functions consume the most execution time.
heaptrack for memory profiling -- tracking every allocation and identifying leaks.
Rust's built-in std::time::Instant for end-to-end latency measurement on specific code paths.

The baseline numbers before any optimization:

Operation	P50 Latency	P99 Latency	Memory
Simple page render	12 ms	45 ms	2.1 MB
Entity list (100 items)	18 ms	72 ms	4.8 MB
Full-text search	25 ms	110 ms	6.2 MB
Hybrid search	38 ms	180 ms	9.4 MB
Compilation (50 files)	340 ms	--	48 MB

These numbers are acceptable for a development environment but unacceptable for production. A P99 of 180 milliseconds for hybrid search means that one in a hundred requests takes nearly a fifth of a second just for the search -- before template rendering, before HTTP serialization, before network transfer.

Memory Optimization: String Interning

The profiling data revealed that string allocation dominated memory usage. FLIN entity field names, entity type names, and commonly used string values were being allocated as separate String instances on every access. In a page that renders 100 entities, the string "User" was allocated 100 times, the string "name" was allocated 100 times, the string "email" was allocated 100 times.

String interning eliminates this redundancy by storing each unique string once and using lightweight references everywhere else:

rustuse std::collections::HashSet;
use std::sync::Arc;

pub struct StringInterner {
    strings: HashSet<Arc<str>>,
}

impl StringInterner {
    pub fn new() -> Self {
        Self {
            strings: HashSet::new(),
        }
    }

    pub fn intern(&mut self, s: &str) -> Arc<str> {
        if let Some(existing) = self.strings.get(s) {
            Arc::clone(existing)
        } else {
            let interned: Arc<str> = Arc::from(s);
            self.strings.insert(Arc::clone(&interned));
            interned
        }
    }
}

We interned entity type names, field names, decorator names, and built-in string values ("none", "true", "false", common error messages). The result was dramatic:

Scenario	Before	After	Reduction
100 User entities loaded	4.8 MB	1.9 MB	60%
1,000 entities (mixed types)	38 MB	14 MB	63%
Template with 50 variables	2.1 MB	0.9 MB	57%

String interning reduced memory usage by approximately 60% for entity-heavy workloads. This is one of those optimizations that costs very little in implementation complexity but pays enormous dividends.

Memory Optimization: Value Representation

FLIN's Value enum -- the runtime representation of every value in the language -- was 72 bytes. For integers, floats, booleans, and none values, this was wasteful. A 64-bit integer does not need 72 bytes of storage.

The original representation:

rust// Before: 72 bytes per Value
pub enum Value {
    None,
    Bool(bool),
    Int(i64),
    Float(f64),
    Text(String),           // 24 bytes (ptr + len + capacity)
    List(Vec<Value>),       // 24 bytes
    Map(HashMap<String, Value>), // 48 bytes
    Entity(Box<Entity>),    // 8 bytes (pointer)
}

Rust enums are sized to their largest variant. The Map variant with its HashMap forced every Value to be 72 bytes, even a simple Value::Int(42).

We restructured the representation using boxing for large variants:

rust// After: 24 bytes per Value
pub enum Value {
    None,
    Bool(bool),
    Int(i64),
    Float(f64),
    Text(Arc<str>),              // 8 bytes (pointer)
    List(Arc<Vec<Value>>),       // 8 bytes (pointer)
    Map(Arc<HashMap<Arc<str>, Value>>),  // 8 bytes (pointer)
    Entity(Arc<Entity>),         // 8 bytes (pointer)
}

By wrapping the large variants in Arc (atomically reference-counted pointer), every Value is now 24 bytes -- the size of the enum discriminant plus the largest inline payload (i64 or f64 at 8 bytes, plus the discriminant and padding). The Arc also enables cheap cloning: copying a Value::List with 10,000 elements is a pointer copy and an atomic increment, not a deep copy.

This change reduced the memory footprint of the VM's value stack by 67% and made value passing between functions essentially free.

Compilation Speed: Parallel Module Compilation

FLIN applications are organized into files: app/index.flin, app/api/users.flin, app/admin/dashboard.flin, and so on. Before Phase 3, these files were compiled sequentially -- each file was lexed, parsed, type-checked, and code-generated before the next file started.

For a 50-file application, this meant the compilation pipeline was serialized across 50 files, even though most files are independent. A user page does not depend on an admin page. An API endpoint does not depend on a template component.

We introduced parallel module compilation using Rust's rayon crate:

rustuse rayon::prelude::*;

pub fn compile_project(project_dir: &Path) -> Result<CompiledProject, CompileError> {
    // Phase 1: Discover all .flin files
    let source_files = discover_flin_files(project_dir)?;

    // Phase 2: Lex and parse all files in parallel
    let parsed_modules: Vec<ParsedModule> = source_files
        .par_iter()
        .map(|file| {
            let source = std::fs::read_to_string(file)?;
            let tokens = lex(&source, file)?;
            let ast = parse(&tokens, file)?;
            Ok(ParsedModule { file: file.clone(), ast })
        })
        .collect::<Result<Vec<_>, CompileError>>()?;

    // Phase 3: Build the global type environment (sequential)
    let mut type_env = TypeEnvironment::new();
    for module in &parsed_modules {
        type_env.register_module_exports(&module.ast)?;
    }

    // Phase 4: Type-check all modules in parallel
    let checked_modules: Vec<CheckedModule> = parsed_modules
        .par_iter()
        .map(|module| {
            let checker = TypeChecker::new(&type_env);
            checker.check_module(&module.ast)
        })
        .collect::<Result<Vec<_>, CompileError>>()?;

    // Phase 5: Code generation (sequential -- writes to shared bytecode buffer)
    let mut codegen = CodeGenerator::new();
    for module in &checked_modules {
        codegen.generate_module(module)?;
    }

    Ok(codegen.finalize())
}

Phases 2 and 4 (parsing and type-checking) run in parallel across all CPU cores. Phase 3 (building the global type environment) must be sequential because it aggregates type information from all modules. Phase 5 (code generation) is also sequential because it writes to a shared bytecode buffer with offsets that depend on the previous module's output.

On an 8-core machine, this reduced compilation time for a 50-file project from 340 milliseconds to 95 milliseconds -- a 3.6x improvement. The speedup is not 8x because phases 3 and 5 are sequential, following Amdahl's law.

Runtime Optimization: Bytecode Dispatch

The FLIN VM executes bytecode instructions in a dispatch loop. The original implementation used a match statement:

rust// Before: match-based dispatch
loop {
    let instruction = self.bytecode[self.ip];
    self.ip += 1;

    match instruction {
        Op::LoadConst(idx) => { /* ... */ }
        Op::Add => { /* ... */ }
        Op::Call(func_id) => { /* ... */ }
        // ... 120+ opcodes
    }
}

A match with 120+ arms compiles to a jump table, which is reasonably fast. But the CPU's branch predictor struggles with indirect jumps through a jump table because the target depends on the runtime value of the instruction.

We switched to computed goto using Rust's function pointer table -- a technique used by CPython, LuaJIT, and other high-performance interpreters:

rusttype OpcodeHandler = fn(&mut VM) -> Result<(), RuntimeError>;

static DISPATCH_TABLE: [OpcodeHandler; 128] = [
    VM::op_load_const,    // 0
    VM::op_load_local,    // 1
    VM::op_store_local,   // 2
    VM::op_add,           // 3
    VM::op_subtract,      // 4
    VM::op_multiply,      // 5
    VM::op_divide,        // 6
    VM::op_call,          // 7
    // ... 120 more handlers
];

impl VM {
    pub fn execute(&mut self) -> Result<Value, RuntimeError> {
        loop {
            let opcode = self.bytecode[self.ip] as usize;
            self.ip += 1;

            if opcode >= DISPATCH_TABLE.len() {
                return Err(RuntimeError::new(
                    "InvalidOpcode",
                    &format!("Unknown opcode: {}", opcode),
                    self.current_span(),
                ));
            }

            DISPATCH_TABLE[opcode](self)?;

            if self.should_return {
                break;
            }
        }

        self.pop()
    }
}

The function pointer table approach has two advantages: each handler is a separate function that the CPU can predict independently, and the table lookup is a single indexed memory access. In our benchmarks, this improved bytecode execution speed by 15-20% across all workloads.

Runtime Optimization: Entity Query Compilation

FLIN's query builder (Entity.where(field == value).order(field).limit(n)) was previously interpreted at runtime -- each method call walked the AST of the filter expression and built a query plan dynamically. For queries that execute repeatedly (in loops, in templates, in API endpoints), this interpretation overhead was significant.

We added query compilation: the first time a query pattern is encountered, it is compiled into an optimized scan function. Subsequent executions use the compiled function directly.

rustpub struct CompiledQuery {
    filter: Box<dyn Fn(&Entity) -> bool + Send + Sync>,
    sort_key: Option<Box<dyn Fn(&Entity) -> Value + Send + Sync>>,
    sort_order: SortOrder,
    limit: Option<usize>,
    offset: usize,
}

impl CompiledQuery {
    pub fn execute(&self, entities: &[Entity]) -> Vec<&Entity> {
        let mut results: Vec<&Entity> = entities
            .iter()
            .filter(|e| (self.filter)(e))
            .collect();

        if let Some(ref sort_key) = self.sort_key {
            results.sort_by(|a, b| {
                let ka = sort_key(a);
                let kb = sort_key(b);
                match self.sort_order {
                    SortOrder::Asc => ka.cmp(&kb),
                    SortOrder::Desc => kb.cmp(&ka),
                }
            });
        }

        results
            .into_iter()
            .skip(self.offset)
            .take(self.limit.unwrap_or(usize::MAX))
            .collect()
    }
}

Query compilation reduced the latency of entity list operations by 40% -- from 18 ms P50 to 11 ms P50 for 100-item queries against a 10,000-entity table.

The Results

After all Phase 3 optimizations, the benchmark numbers:

Operation	Before P50	After P50	Improvement
Simple page render	12 ms	4 ms	3x faster
Entity list (100 items)	18 ms	7 ms	2.6x faster
Full-text search	25 ms	9 ms	2.8x faster
Hybrid search	38 ms	14 ms	2.7x faster
Compilation (50 files)	340 ms	95 ms	3.6x faster

Metric	Before	After	Improvement
Value size	72 bytes	24 bytes	67% smaller
Entity memory (100 items)	4.8 MB	1.9 MB	60% less
Idle memory (server running)	28 MB	12 MB	57% less

Every single operation in the benchmark suite improved by at least 2.5x. Memory usage dropped by more than half. And none of these optimizations changed the semantics of the language or the behavior of the runtime -- they are purely internal improvements, invisible to the FLIN developer.

The Three Phases Complete

Production hardening across Sessions 244, 245, and 246 transformed FLIN from a working prototype into a system suitable for production deployment:

Phase 1 (Stability): The system does not crash. Errors are caught, logged, and returned as meaningful responses. 35 of 47 crash vectors eliminated.
Phase 2 (Reliability): The system maintains consistent state. Operations are atomic, the WAL recovers gracefully, and foreign keys are enforced across transactions.
Phase 3 (Performance): The system is fast. Memory usage halved, compilation 3.6x faster, runtime operations 2.5-3x faster.

These three properties -- stability, reliability, performance -- are the non-negotiable requirements for production software. Features attract users. These properties keep them.

This is Part 183 of the "How We Built FLIN" series, documenting how a CEO in Abidjan and an AI CTO designed and built a programming language from scratch.

Series Navigation: - [182] Production Hardening Phase 2: Reliability - [183] Production Hardening Phase 3: Performance (you are here) - [184] MVP Status Review: What's Ready and What's Not

#183 -- Production Hardening Phase 3

Profiling Methodology

Memory Optimization: String Interning

Memory Optimization: Value Representation

Compilation Speed: Parallel Module Compilation

Runtime Optimization: Bytecode Dispatch

Runtime Optimization: Entity Query Compilation

The Results

The Three Phases Complete

Responses

Related Articles

Thirteen Agents, Forty-Three Minutes: The First Claude Fable 5 Workflow Session, And What A Deterministic Orchestration Script Changes About Multi-Agent Builds

The gate caught its own drift: one day inside CASP with Claude Fable 5

The CASP Transplant: How The Six-File Discipline Moved From Conductor To An Anti-Fraud Transport ERP, What The /next Skill Adds When The Operator Just Types 'next', And Why The Cost Of CASP Drift Rises When The Project Is Someone Else's Cash