Back to flin
flin

Production Hardening Phase 3

Phase 3 of production hardening: memory optimization, faster compilation, and performance.

Thales & Claude | March 25, 2026 10 min flin
flinproductionhardeningperformanceoptimization

A system that does not crash (Phase 1) and maintains consistent state (Phase 2) still has one critical requirement for production: it must be fast enough. Not fast in benchmarks -- fast in the context that matters. A web application that takes 800 milliseconds to render a page is functionally broken, even if it never crashes and never loses data.

Session 246 focused on the performance dimension of production readiness. We profiled FLIN under realistic workloads, identified bottlenecks, and systematically eliminated them. The work fell into three categories: memory optimization, compilation speed, and runtime execution.

Profiling Methodology

Before optimizing anything, we needed data. We constructed a benchmark application that exercises every major FLIN subsystem: entity CRUD, template rendering, search (BM25 and semantic), cron jobs, file uploads, and WebSocket connections. The application was seeded with 10,000 entities across 5 entity types, with full-text search indexes and foreign key relationships.

We used three profiling tools:

  • perf for CPU profiling -- identifying which functions consume the most execution time.
  • heaptrack for memory profiling -- tracking every allocation and identifying leaks.
  • Rust's built-in std::time::Instant for end-to-end latency measurement on specific code paths.

The baseline numbers before any optimization:

OperationP50 LatencyP99 LatencyMemory
Simple page render12 ms45 ms2.1 MB
Entity list (100 items)18 ms72 ms4.8 MB
Full-text search25 ms110 ms6.2 MB
Hybrid search38 ms180 ms9.4 MB
Compilation (50 files)340 ms--48 MB

These numbers are acceptable for a development environment but unacceptable for production. A P99 of 180 milliseconds for hybrid search means that one in a hundred requests takes nearly a fifth of a second just for the search -- before template rendering, before HTTP serialization, before network transfer.

Memory Optimization: String Interning

The profiling data revealed that string allocation dominated memory usage. FLIN entity field names, entity type names, and commonly used string values were being allocated as separate String instances on every access. In a page that renders 100 entities, the string "User" was allocated 100 times, the string "name" was allocated 100 times, the string "email" was allocated 100 times.

String interning eliminates this redundancy by storing each unique string once and using lightweight references everywhere else:

use std::collections::HashSet;
use std::sync::Arc;

pub struct StringInterner { strings: HashSet>, }

impl StringInterner { pub fn new() -> Self { Self { strings: HashSet::new(), } }

pub fn intern(&mut self, s: &str) -> Arc { if let Some(existing) = self.strings.get(s) { Arc::clone(existing) } else { let interned: Arc = Arc::from(s); self.strings.insert(Arc::clone(&interned)); interned } } } ```

We interned entity type names, field names, decorator names, and built-in string values ("none", "true", "false", common error messages). The result was dramatic:

ScenarioBeforeAfterReduction
100 User entities loaded4.8 MB1.9 MB60%
1,000 entities (mixed types)38 MB14 MB63%
Template with 50 variables2.1 MB0.9 MB57%

String interning reduced memory usage by approximately 60% for entity-heavy workloads. This is one of those optimizations that costs very little in implementation complexity but pays enormous dividends.

Memory Optimization: Value Representation

FLIN's Value enum -- the runtime representation of every value in the language -- was 72 bytes. For integers, floats, booleans, and none values, this was wasteful. A 64-bit integer does not need 72 bytes of storage.

The original representation:

// Before: 72 bytes per Value
pub enum Value {
    None,
    Bool(bool),
    Int(i64),
    Float(f64),
    Text(String),           // 24 bytes (ptr + len + capacity)
    List(Vec<Value>),       // 24 bytes
    Map(HashMap<String, Value>), // 48 bytes
    Entity(Box<Entity>),    // 8 bytes (pointer)
}

Rust enums are sized to their largest variant. The Map variant with its HashMap forced every Value to be 72 bytes, even a simple Value::Int(42).

We restructured the representation using boxing for large variants:

// After: 24 bytes per Value
pub enum Value {
    None,
    Bool(bool),
    Int(i64),
    Float(f64),
    Text(Arc<str>),              // 8 bytes (pointer)
    List(Arc<Vec<Value>>),       // 8 bytes (pointer)
    Map(Arc<HashMap<Arc<str>, Value>>),  // 8 bytes (pointer)
    Entity(Arc<Entity>),         // 8 bytes (pointer)
}

By wrapping the large variants in Arc (atomically reference-counted pointer), every Value is now 24 bytes -- the size of the enum discriminant plus the largest inline payload (i64 or f64 at 8 bytes, plus the discriminant and padding). The Arc also enables cheap cloning: copying a Value::List with 10,000 elements is a pointer copy and an atomic increment, not a deep copy.

This change reduced the memory footprint of the VM's value stack by 67% and made value passing between functions essentially free.

Compilation Speed: Parallel Module Compilation

FLIN applications are organized into files: app/index.flin, app/api/users.flin, app/admin/dashboard.flin, and so on. Before Phase 3, these files were compiled sequentially -- each file was lexed, parsed, type-checked, and code-generated before the next file started.

For a 50-file application, this meant the compilation pipeline was serialized across 50 files, even though most files are independent. A user page does not depend on an admin page. An API endpoint does not depend on a template component.

We introduced parallel module compilation using Rust's rayon crate:

use rayon::prelude::*;

pub fn compile_project(project_dir: &Path) -> Result { // Phase 1: Discover all .flin files let source_files = discover_flin_files(project_dir)?;

// Phase 2: Lex and parse all files in parallel let parsed_modules: Vec = source_files .par_iter() .map(|file| { let source = std::fs::read_to_string(file)?; let tokens = lex(&source, file)?; let ast = parse(&tokens, file)?; Ok(ParsedModule { file: file.clone(), ast }) }) .collect::, CompileError>>()?;

// Phase 3: Build the global type environment (sequential) let mut type_env = TypeEnvironment::new(); for module in &parsed_modules { type_env.register_module_exports(&module.ast)?; }

// Phase 4: Type-check all modules in parallel let checked_modules: Vec = parsed_modules .par_iter() .map(|module| { let checker = TypeChecker::new(&type_env); checker.check_module(&module.ast) }) .collect::, CompileError>>()?;

// Phase 5: Code generation (sequential -- writes to shared bytecode buffer) let mut codegen = CodeGenerator::new(); for module in &checked_modules { codegen.generate_module(module)?; }

Ok(codegen.finalize()) } ```

Phases 2 and 4 (parsing and type-checking) run in parallel across all CPU cores. Phase 3 (building the global type environment) must be sequential because it aggregates type information from all modules. Phase 5 (code generation) is also sequential because it writes to a shared bytecode buffer with offsets that depend on the previous module's output.

On an 8-core machine, this reduced compilation time for a 50-file project from 340 milliseconds to 95 milliseconds -- a 3.6x improvement. The speedup is not 8x because phases 3 and 5 are sequential, following Amdahl's law.

Runtime Optimization: Bytecode Dispatch

The FLIN VM executes bytecode instructions in a dispatch loop. The original implementation used a match statement:

// Before: match-based dispatch
loop {
    let instruction = self.bytecode[self.ip];
    self.ip += 1;

match instruction { Op::LoadConst(idx) => { / ... / } Op::Add => { / ... / } Op::Call(func_id) => { / ... / } // ... 120+ opcodes } } ```

A match with 120+ arms compiles to a jump table, which is reasonably fast. But the CPU's branch predictor struggles with indirect jumps through a jump table because the target depends on the runtime value of the instruction.

We switched to computed goto using Rust's function pointer table -- a technique used by CPython, LuaJIT, and other high-performance interpreters:

type OpcodeHandler = fn(&mut VM) -> Result<(), RuntimeError>;

static DISPATCH_TABLE: [OpcodeHandler; 128] = [ VM::op_load_const, // 0 VM::op_load_local, // 1 VM::op_store_local, // 2 VM::op_add, // 3 VM::op_subtract, // 4 VM::op_multiply, // 5 VM::op_divide, // 6 VM::op_call, // 7 // ... 120 more handlers ];

impl VM { pub fn execute(&mut self) -> Result { loop { let opcode = self.bytecode[self.ip] as usize; self.ip += 1;

if opcode >= DISPATCH_TABLE.len() { return Err(RuntimeError::new( "InvalidOpcode", &format!("Unknown opcode: {}", opcode), self.current_span(), )); }

DISPATCH_TABLEopcode?;

if self.should_return { break; } }

self.pop() } } ```

The function pointer table approach has two advantages: each handler is a separate function that the CPU can predict independently, and the table lookup is a single indexed memory access. In our benchmarks, this improved bytecode execution speed by 15-20% across all workloads.

Runtime Optimization: Entity Query Compilation

FLIN's query builder (Entity.where(field == value).order(field).limit(n)) was previously interpreted at runtime -- each method call walked the AST of the filter expression and built a query plan dynamically. For queries that execute repeatedly (in loops, in templates, in API endpoints), this interpretation overhead was significant.

We added query compilation: the first time a query pattern is encountered, it is compiled into an optimized scan function. Subsequent executions use the compiled function directly.

pub struct CompiledQuery {
    filter: Box<dyn Fn(&Entity) -> bool + Send + Sync>,
    sort_key: Option<Box<dyn Fn(&Entity) -> Value + Send + Sync>>,
    sort_order: SortOrder,
    limit: Option<usize>,
    offset: usize,
}

impl CompiledQuery { pub fn execute(&self, entities: &[Entity]) -> Vec<&Entity> { let mut results: Vec<&Entity> = entities .iter() .filter(|e| (self.filter)(e)) .collect();

if let Some(ref sort_key) = self.sort_key { results.sort_by(|a, b| { let ka = sort_key(a); let kb = sort_key(b); match self.sort_order { SortOrder::Asc => ka.cmp(&kb), SortOrder::Desc => kb.cmp(&ka), } }); }

results .into_iter() .skip(self.offset) .take(self.limit.unwrap_or(usize::MAX)) .collect() } } ```

Query compilation reduced the latency of entity list operations by 40% -- from 18 ms P50 to 11 ms P50 for 100-item queries against a 10,000-entity table.

The Results

After all Phase 3 optimizations, the benchmark numbers:

OperationBefore P50After P50Improvement
Simple page render12 ms4 ms3x faster
Entity list (100 items)18 ms7 ms2.6x faster
Full-text search25 ms9 ms2.8x faster
Hybrid search38 ms14 ms2.7x faster
Compilation (50 files)340 ms95 ms3.6x faster
MetricBeforeAfterImprovement
Value size72 bytes24 bytes67% smaller
Entity memory (100 items)4.8 MB1.9 MB60% less
Idle memory (server running)28 MB12 MB57% less

Every single operation in the benchmark suite improved by at least 2.5x. Memory usage dropped by more than half. And none of these optimizations changed the semantics of the language or the behavior of the runtime -- they are purely internal improvements, invisible to the FLIN developer.

The Three Phases Complete

Production hardening across Sessions 244, 245, and 246 transformed FLIN from a working prototype into a system suitable for production deployment:

  • Phase 1 (Stability): The system does not crash. Errors are caught, logged, and returned as meaningful responses. 35 of 47 crash vectors eliminated.
  • Phase 2 (Reliability): The system maintains consistent state. Operations are atomic, the WAL recovers gracefully, and foreign keys are enforced across transactions.
  • Phase 3 (Performance): The system is fast. Memory usage halved, compilation 3.6x faster, runtime operations 2.5-3x faster.

These three properties -- stability, reliability, performance -- are the non-negotiable requirements for production software. Features attract users. These properties keep them.

---

This is Part 183 of the "How We Built FLIN" series, documenting how a CEO in Abidjan and an AI CTO designed and built a programming language from scratch.

Series Navigation: - [182] Production Hardening Phase 2: Reliability - [183] Production Hardening Phase 3: Performance (you are here) - [184] MVP Status Review: What's Ready and What's Not

Share this article:

Responses

Write a response
0/2000
Loading responses...

Related Articles