A system that does not crash (Phase 1) and maintains consistent state (Phase 2) still has one critical requirement for production: it must be fast enough. Not fast in benchmarks -- fast in the context that matters. A web application that takes 800 milliseconds to render a page is functionally broken, even if it never crashes and never loses data.
Session 246 focused on the performance dimension of production readiness. We profiled FLIN under realistic workloads, identified bottlenecks, and systematically eliminated them. The work fell into three categories: memory optimization, compilation speed, and runtime execution.
Profiling Methodology
Before optimizing anything, we needed data. We constructed a benchmark application that exercises every major FLIN subsystem: entity CRUD, template rendering, search (BM25 and semantic), cron jobs, file uploads, and WebSocket connections. The application was seeded with 10,000 entities across 5 entity types, with full-text search indexes and foreign key relationships.
We used three profiling tools:
perffor CPU profiling -- identifying which functions consume the most execution time.heaptrackfor memory profiling -- tracking every allocation and identifying leaks.- Rust's built-in
std::time::Instantfor end-to-end latency measurement on specific code paths.
The baseline numbers before any optimization:
| Operation | P50 Latency | P99 Latency | Memory |
|---|---|---|---|
| Simple page render | 12 ms | 45 ms | 2.1 MB |
| Entity list (100 items) | 18 ms | 72 ms | 4.8 MB |
| Full-text search | 25 ms | 110 ms | 6.2 MB |
| Hybrid search | 38 ms | 180 ms | 9.4 MB |
| Compilation (50 files) | 340 ms | -- | 48 MB |
These numbers are acceptable for a development environment but unacceptable for production. A P99 of 180 milliseconds for hybrid search means that one in a hundred requests takes nearly a fifth of a second just for the search -- before template rendering, before HTTP serialization, before network transfer.
Memory Optimization: String Interning
The profiling data revealed that string allocation dominated memory usage. FLIN entity field names, entity type names, and commonly used string values were being allocated as separate String instances on every access. In a page that renders 100 entities, the string "User" was allocated 100 times, the string "name" was allocated 100 times, the string "email" was allocated 100 times.
String interning eliminates this redundancy by storing each unique string once and using lightweight references everywhere else:
use std::collections::HashSet;
use std::sync::Arc;pub struct StringInterner {
strings: HashSet
impl StringInterner { pub fn new() -> Self { Self { strings: HashSet::new(), } }
pub fn intern(&mut self, s: &str) -> Arc
We interned entity type names, field names, decorator names, and built-in string values ("none", "true", "false", common error messages). The result was dramatic:
| Scenario | Before | After | Reduction |
|---|---|---|---|
| 100 User entities loaded | 4.8 MB | 1.9 MB | 60% |
| 1,000 entities (mixed types) | 38 MB | 14 MB | 63% |
| Template with 50 variables | 2.1 MB | 0.9 MB | 57% |
String interning reduced memory usage by approximately 60% for entity-heavy workloads. This is one of those optimizations that costs very little in implementation complexity but pays enormous dividends.
Memory Optimization: Value Representation
FLIN's Value enum -- the runtime representation of every value in the language -- was 72 bytes. For integers, floats, booleans, and none values, this was wasteful. A 64-bit integer does not need 72 bytes of storage.
The original representation:
// Before: 72 bytes per Value
pub enum Value {
None,
Bool(bool),
Int(i64),
Float(f64),
Text(String), // 24 bytes (ptr + len + capacity)
List(Vec<Value>), // 24 bytes
Map(HashMap<String, Value>), // 48 bytes
Entity(Box<Entity>), // 8 bytes (pointer)
}Rust enums are sized to their largest variant. The Map variant with its HashMap forced every Value to be 72 bytes, even a simple Value::Int(42).
We restructured the representation using boxing for large variants:
// After: 24 bytes per Value
pub enum Value {
None,
Bool(bool),
Int(i64),
Float(f64),
Text(Arc<str>), // 8 bytes (pointer)
List(Arc<Vec<Value>>), // 8 bytes (pointer)
Map(Arc<HashMap<Arc<str>, Value>>), // 8 bytes (pointer)
Entity(Arc<Entity>), // 8 bytes (pointer)
}By wrapping the large variants in Arc (atomically reference-counted pointer), every Value is now 24 bytes -- the size of the enum discriminant plus the largest inline payload (i64 or f64 at 8 bytes, plus the discriminant and padding). The Arc also enables cheap cloning: copying a Value::List with 10,000 elements is a pointer copy and an atomic increment, not a deep copy.
This change reduced the memory footprint of the VM's value stack by 67% and made value passing between functions essentially free.
Compilation Speed: Parallel Module Compilation
FLIN applications are organized into files: app/index.flin, app/api/users.flin, app/admin/dashboard.flin, and so on. Before Phase 3, these files were compiled sequentially -- each file was lexed, parsed, type-checked, and code-generated before the next file started.
For a 50-file application, this meant the compilation pipeline was serialized across 50 files, even though most files are independent. A user page does not depend on an admin page. An API endpoint does not depend on a template component.
We introduced parallel module compilation using Rust's rayon crate:
use rayon::prelude::*;pub fn compile_project(project_dir: &Path) -> Result
// Phase 2: Lex and parse all files in parallel
let parsed_modules: Vec
// Phase 3: Build the global type environment (sequential) let mut type_env = TypeEnvironment::new(); for module in &parsed_modules { type_env.register_module_exports(&module.ast)?; }
// Phase 4: Type-check all modules in parallel
let checked_modules: Vec
// Phase 5: Code generation (sequential -- writes to shared bytecode buffer) let mut codegen = CodeGenerator::new(); for module in &checked_modules { codegen.generate_module(module)?; }
Ok(codegen.finalize()) } ```
Phases 2 and 4 (parsing and type-checking) run in parallel across all CPU cores. Phase 3 (building the global type environment) must be sequential because it aggregates type information from all modules. Phase 5 (code generation) is also sequential because it writes to a shared bytecode buffer with offsets that depend on the previous module's output.
On an 8-core machine, this reduced compilation time for a 50-file project from 340 milliseconds to 95 milliseconds -- a 3.6x improvement. The speedup is not 8x because phases 3 and 5 are sequential, following Amdahl's law.
Runtime Optimization: Bytecode Dispatch
The FLIN VM executes bytecode instructions in a dispatch loop. The original implementation used a match statement:
// Before: match-based dispatch
loop {
let instruction = self.bytecode[self.ip];
self.ip += 1;match instruction { Op::LoadConst(idx) => { / ... / } Op::Add => { / ... / } Op::Call(func_id) => { / ... / } // ... 120+ opcodes } } ```
A match with 120+ arms compiles to a jump table, which is reasonably fast. But the CPU's branch predictor struggles with indirect jumps through a jump table because the target depends on the runtime value of the instruction.
We switched to computed goto using Rust's function pointer table -- a technique used by CPython, LuaJIT, and other high-performance interpreters:
type OpcodeHandler = fn(&mut VM) -> Result<(), RuntimeError>;static DISPATCH_TABLE: [OpcodeHandler; 128] = [ VM::op_load_const, // 0 VM::op_load_local, // 1 VM::op_store_local, // 2 VM::op_add, // 3 VM::op_subtract, // 4 VM::op_multiply, // 5 VM::op_divide, // 6 VM::op_call, // 7 // ... 120 more handlers ];
impl VM {
pub fn execute(&mut self) -> Result
if opcode >= DISPATCH_TABLE.len() { return Err(RuntimeError::new( "InvalidOpcode", &format!("Unknown opcode: {}", opcode), self.current_span(), )); }
DISPATCH_TABLEopcode?;
if self.should_return { break; } }
self.pop() } } ```
The function pointer table approach has two advantages: each handler is a separate function that the CPU can predict independently, and the table lookup is a single indexed memory access. In our benchmarks, this improved bytecode execution speed by 15-20% across all workloads.
Runtime Optimization: Entity Query Compilation
FLIN's query builder (Entity.where(field == value).order(field).limit(n)) was previously interpreted at runtime -- each method call walked the AST of the filter expression and built a query plan dynamically. For queries that execute repeatedly (in loops, in templates, in API endpoints), this interpretation overhead was significant.
We added query compilation: the first time a query pattern is encountered, it is compiled into an optimized scan function. Subsequent executions use the compiled function directly.
pub struct CompiledQuery {
filter: Box<dyn Fn(&Entity) -> bool + Send + Sync>,
sort_key: Option<Box<dyn Fn(&Entity) -> Value + Send + Sync>>,
sort_order: SortOrder,
limit: Option<usize>,
offset: usize,
}impl CompiledQuery { pub fn execute(&self, entities: &[Entity]) -> Vec<&Entity> { let mut results: Vec<&Entity> = entities .iter() .filter(|e| (self.filter)(e)) .collect();
if let Some(ref sort_key) = self.sort_key { results.sort_by(|a, b| { let ka = sort_key(a); let kb = sort_key(b); match self.sort_order { SortOrder::Asc => ka.cmp(&kb), SortOrder::Desc => kb.cmp(&ka), } }); }
results .into_iter() .skip(self.offset) .take(self.limit.unwrap_or(usize::MAX)) .collect() } } ```
Query compilation reduced the latency of entity list operations by 40% -- from 18 ms P50 to 11 ms P50 for 100-item queries against a 10,000-entity table.
The Results
After all Phase 3 optimizations, the benchmark numbers:
| Operation | Before P50 | After P50 | Improvement |
|---|---|---|---|
| Simple page render | 12 ms | 4 ms | 3x faster |
| Entity list (100 items) | 18 ms | 7 ms | 2.6x faster |
| Full-text search | 25 ms | 9 ms | 2.8x faster |
| Hybrid search | 38 ms | 14 ms | 2.7x faster |
| Compilation (50 files) | 340 ms | 95 ms | 3.6x faster |
| Metric | Before | After | Improvement |
|---|---|---|---|
| Value size | 72 bytes | 24 bytes | 67% smaller |
| Entity memory (100 items) | 4.8 MB | 1.9 MB | 60% less |
| Idle memory (server running) | 28 MB | 12 MB | 57% less |
Every single operation in the benchmark suite improved by at least 2.5x. Memory usage dropped by more than half. And none of these optimizations changed the semantics of the language or the behavior of the runtime -- they are purely internal improvements, invisible to the FLIN developer.
The Three Phases Complete
Production hardening across Sessions 244, 245, and 246 transformed FLIN from a working prototype into a system suitable for production deployment:
- Phase 1 (Stability): The system does not crash. Errors are caught, logged, and returned as meaningful responses. 35 of 47 crash vectors eliminated.
- Phase 2 (Reliability): The system maintains consistent state. Operations are atomic, the WAL recovers gracefully, and foreign keys are enforced across transactions.
- Phase 3 (Performance): The system is fast. Memory usage halved, compilation 3.6x faster, runtime operations 2.5-3x faster.
These three properties -- stability, reliability, performance -- are the non-negotiable requirements for production software. Features attract users. These properties keep them.
---
This is Part 183 of the "How We Built FLIN" series, documenting how a CEO in Abidjan and an AI CTO designed and built a programming language from scratch.
Series Navigation: - [182] Production Hardening Phase 2: Reliability - [183] Production Hardening Phase 3: Performance (you are here) - [184] MVP Status Review: What's Ready and What's Not