#012 -- Building a Lexer From Scratch in Rust

A lexer is the first thing your code meets. It turns characters into meaning. Before a parser can understand your program's structure, before a type checker can verify its correctness, before a code generator can emit bytecode -- the lexer must read your source file one character at a time and produce a stream of tokens. It is the humblest phase of a compiler, and arguably the most important. If the lexer gets a token wrong, every subsequent phase inherits the error.

FLIN's lexer has one unusual challenge that most language lexers do not face: it must handle two fundamentally different syntactic modes in the same source file. FLIN programs contain imperative code -- variable declarations, control flow, function calls -- interleaved with HTML-like view declarations. The lexer must seamlessly switch between these modes without losing its place, misidentifying tokens, or confusing a less-than operator with an opening HTML tag.

This is the story of how we built that lexer in Rust, from the first Peekable<CharIndices> to the final test passing.

The Scanner Struct

Every lexer is, at its core, a cursor moving through a character stream. FLIN's scanner holds six pieces of state:

rustpub struct Lexer<'a> {
    source: &'a str,
    chars: Peekable<CharIndices<'a>>,
    line: u32,
    column: u32,
    start: u32,
    mode: LexerMode,
}

#[derive(Debug, Clone, Copy, PartialEq)]
enum LexerMode {
    Code,           // Normal code
    View,           // Inside HTML-like tags
    ViewExpression, // Inside {expression} in view
}

The source field holds a reference to the original source text. We never copy it -- the lexer borrows it for its entire lifetime, and every lexeme is produced by slicing into this original string. This is Rust's ownership model working for us: the borrow checker guarantees that the source string lives at least as long as the lexer, so our slices are always valid.

The chars field is a Peekable<CharIndices<'a>> -- Rust's standard library iterator that yields (byte_offset, char) pairs. The Peekable wrapper lets us look one character ahead without consuming it, which is essential for disambiguating multi-character tokens like == versus =.

The line and column fields track the current position for error messages. The start field marks the byte offset where the current token began -- when we finish scanning a token, we slice source[start..current] to get the lexeme.

And then there is mode. This is the field that distinguishes FLIN's lexer from a textbook implementation.

The Tokenization Loop

The main loop is deceptively simple:

rustpub fn tokenize(&mut self) -> Result<Vec<Token>, LexError> {
    let mut tokens = Vec::new();

    while let Some(token) = self.next_token()? {
        tokens.push(token);
    }

    tokens.push(Token {
        kind: TokenKind::Eof,
        span: self.current_span(),
        lexeme: String::new(),
    });

    Ok(tokens)
}

Call next_token() in a loop. Push each token onto a vector. When the source is exhausted, append an Eof token. Return the vector.

The Eof token is not just a sentinel -- it is a contract with the parser. The parser can always call self.peek() without checking for an empty token stream, because there is always at least one token: Eof. This eliminates bounds-checking conditionals throughout the parser, which makes the parsing code cleaner and harder to break.

Character-by-Character Dispatch

The next_token() method is where the real work happens. After skipping whitespace and comments, it reads one character and dispatches:

rustfn next_token(&mut self) -> Result<Option<Token>, LexError> {
    self.skip_whitespace();
    self.skip_comments();

    self.start = self.current_offset();

    match self.advance() {
        None => Ok(None),
        Some((_, c)) => {
            let kind = match c {
                // Single-character tokens
                '(' => TokenKind::LeftParen,
                ')' => TokenKind::RightParen,
                '{' => self.handle_left_brace(),
                '}' => self.handle_right_brace(),
                '[' => TokenKind::LeftBracket,
                ']' => TokenKind::RightBracket,
                ',' => TokenKind::Comma,
                ':' => TokenKind::Colon,
                ';' => TokenKind::Semicolon,
                '.' => TokenKind::Dot,
                '@' => TokenKind::At,

                // Multi-character tokens
                '+' => self.match_char('+', TokenKind::PlusPlus, TokenKind::Plus),
                '-' => self.match_char('-', TokenKind::MinusMinus, TokenKind::Minus),
                '=' => self.match_char('=', TokenKind::EqualEqual, TokenKind::Equal),
                '!' => self.match_char('=', TokenKind::NotEqual, TokenKind::Not),
                '<' => self.scan_tag_or_less(),
                '>' => self.match_char('=', TokenKind::GreaterEqual, TokenKind::TagClose),
                '&' => self.expect_char('&', TokenKind::And)?,
                '|' => self.expect_char('|', TokenKind::Or)?,

                // Literals
                '"' => self.scan_string()?,
                c if c.is_ascii_digit() => self.scan_number()?,
                c if c.is_alphabetic() || c == '_' => self.scan_identifier(),

                _ => return Err(LexError::UnexpectedCharacter(c, self.current_position())),
            };

            Ok(Some(Token {
                kind,
                span: self.current_span(),
                lexeme: self.current_lexeme(),
            }))
        }
    }
}

This is a textbook scanner dispatch, with one critical exception: the < character. In most languages, < is unambiguously a less-than operator. In FLIN, it might be a less-than operator, a less-than-or-equal operator (<=), or the start of an HTML tag (<button>). Resolving this ambiguity is the lexer's hardest job.

FLIN source files look like this:

flincount = 0                           // Code mode

<button click={count++}>            // View mode
    {count}                         // Embedded expression
</button>                           // View mode

The lexer must handle three distinct contexts:

Code mode is the default. Tokens are operators, keywords, identifiers, and literals. The < character is a comparison operator.

View mode activates when the lexer encounters a < followed by an alphabetic character (indicating a tag name) or a / (indicating a closing tag). In view mode, the lexer emits TagOpen, TagName, AttrName, TagClose, TagSelfClose, and TagEnd tokens instead of arithmetic and comparison operators.

ViewExpression mode activates when the lexer encounters a { while in view mode. Inside the braces, the lexer reverts to code-mode tokenization -- expressions like count++ are scanned as identifiers and operators, not as tag content. When the matching } is found, the lexer returns to view mode.

The mode transitions are handled by three methods:

rustfn scan_tag_or_less(&mut self) -> TokenKind {
    if self.peek_is_alpha() || self.peek_char() == Some('/') {
        self.mode = LexerMode::View;
        TokenKind::TagOpen
    } else if self.match_next('=') {
        TokenKind::LessEqual
    } else {
        TokenKind::Less
    }
}

fn handle_left_brace(&mut self) -> TokenKind {
    if self.mode == LexerMode::View {
        self.mode = LexerMode::ViewExpression;
    }
    TokenKind::LeftBrace
}

fn handle_right_brace(&mut self) -> TokenKind {
    if self.mode == LexerMode::ViewExpression {
        self.mode = LexerMode::View;
    }
    TokenKind::RightBrace
}

The scan_tag_or_less method is the critical disambiguation point. When the lexer sees <, it peeks at the next character. If it is alphabetic (<button) or a slash (</button>), this is a tag -- switch to view mode and emit TagOpen. If the next character is =, this is <= -- emit LessEqual. Otherwise, it is a plain < -- emit Less.

This lookahead is why the lexer uses Peekable<CharIndices> instead of a raw iterator. The peek costs nothing -- it does not advance the cursor -- and it resolves the ambiguity in constant time.

Scanning Strings

String scanning is straightforward but requires careful handling of escape sequences:

FLIN supports standard escape sequences: \n (newline), \t (tab), \\ (backslash), \" (double quote). The scanner reads characters until it hits an unescaped double quote or the end of the file. If the file ends before the string is closed, the lexer returns a LexError::UnterminatedString with the position where the string started -- not where the file ended. This distinction matters for error messages: "unterminated string starting at line 7, column 12" is actionable. "unexpected end of file at line 94" is not.

Scanning Numbers

Number scanning has two phases. First, consume all ASCII digits (with optional underscore separators for readability: 1_000_000). Then check for a dot followed by more digits to distinguish integers from floats.

The parsed value is stored directly in the token: TokenKind::Integer(i64) or TokenKind::Float(f64). This means the parser never has to re-parse a number literal from its string representation. The lexer does it once, correctly, and every subsequent phase uses the parsed value.

FLIN also supports hexadecimal (0xFF) and binary (0b1010) integer literals. The lexer detects the 0x or 0b prefix and switches to the appropriate digit set. If a hex literal contains non-hex characters, or a binary literal contains digits other than 0 and 1, the lexer produces a specific error: LexError::InvalidHexLiteral or LexError::InvalidBinaryLiteral. Specific error types mean specific error messages, and specific error messages mean developers fix their code faster.

Scanning Identifiers and Keywords

This is the method we defined in the previous article, but it is worth examining the flow:

The lexer encounters an alphabetic character or underscore.
It consumes all subsequent alphanumeric characters and underscores.
It extracts the lexeme by slicing the source string.
It checks the lexeme against 42 keyword strings.
If a keyword matches, it returns TokenKind::Keyword(Keyword::...).
Otherwise, it returns TokenKind::Identifier(lexeme).

The keyword check is a match on &str, which Rust compiles into an efficient lookup. For 42 keywords, this is fast enough that a perfect hash function would not produce a measurable improvement on any realistic source file.

One subtlety: the lexeme for an identifier is a String, not a &str. This means the lexer allocates a new string for every identifier token. We could have used string interning (a global table that deduplicates strings) to reduce allocations. We chose not to, for two reasons. First, the lexer runs once per compilation, and string allocation is not the bottleneck in a compiler that also parses, type-checks, and generates bytecode. Second, interning adds global state, which complicates testing and makes the lexer harder to reason about. Simplicity won.

Whitespace and Comments

The skip_whitespace() method consumes spaces, tabs, and carriage returns without producing tokens. Newlines, however, are treated differently depending on the mode.

In code mode, newlines are significant -- FLIN uses newlines as statement terminators in many contexts, similar to Go or Python. The lexer emits a TokenKind::Newline for each newline character, which the parser uses to determine statement boundaries.

In view mode, whitespace (including newlines) between tags is collapsed into text content, following HTML conventions. The modal design ensures that the same whitespace character is handled differently depending on context.

Comments use the // prefix and extend to the end of the line. The lexer skips them entirely -- they produce no tokens. Block comments (/<em> ... </em>/) are not supported in FLIN. This was a deliberate design choice: block comments add complexity to the lexer (they can nest, they can span files if you are not careful, they interact poorly with string literals) and their absence has never been a practical problem in languages like Python or Ruby that also lack them.

Error Handling

The lexer can fail in four ways:

Unexpected character: a character that does not match any scanning rule. Unicode outside of string literals, for example.
Unterminated string: a double quote that is never closed.
Invalid number literal: a hex prefix followed by non-hex digits, or a number with multiple decimal points.
Expected character not found: the lexer sees & and expects another & (to form &&), but the next character is something else.

Each error variant carries a Position, so the error message can point to the exact location in the source file. This is a small investment that pays dividends: a compiler with good error messages is a compiler that developers will actually use.

rustpub enum LexError {
    UnexpectedCharacter(char, Position),
    UnterminatedString(Position),
    InvalidHexLiteral(Position),
    InvalidBinaryLiteral(Position),
    ExpectedCharacter { expected: char, found: Option<char>, position: Position },
}

We chose to make LexError a proper enum rather than returning string messages. This means downstream code can match on error variants and handle them differently -- an IDE plugin might want to auto-fix an unterminated string, for instance, while an unexpected character just gets a red underline. Structured errors are more work to define, but they compose better than strings.

Testing the Lexer

By session 4, the lexer had 97 unit tests. These fell into several categories:

Single-token tests: verify that each individual token type is scanned correctly. "+" produces Plus. "==" produces EqualEqual. "entity" produces Keyword(Keyword::Entity). These are boring but essential -- they ensure the dispatch table is wired correctly.

Multi-token tests: verify that sequences of tokens are scanned in the right order. "x = 42" produces [Identifier("x"), Equal, Integer(42), Eof]. These catch interaction bugs where scanning one token leaves the cursor in the wrong position for the next.

Mode transition tests: verify that the lexer switches between code, view, and view-expression modes correctly. "<button click={x}>" must produce [TagOpen, Identifier("button"), Identifier("click"), Equal, LeftBrace, Identifier("x"), RightBrace, TagClose] with the correct mode transitions at each brace.

Error tests: verify that the lexer produces the right error for malformed input. An unterminated string must return UnterminatedString, not a panic or a wrong token.

The 97 tests compiled and passed by the end of session 4. When we started building the parser in session 5, the token stream was trustworthy. We never had to debug a parser bug that turned out to be a lexer bug. That is the return on investment of thorough lexer testing.

Performance Characteristics

FLIN's lexer is single-pass and O(n) in the length of the source file. Each character is examined at most twice: once by peek() and once by advance(). The only allocation is the token vector itself and the String fields inside identifier and string literal tokens.

For a typical FLIN source file of a few hundred lines, lexing takes microseconds. This is not because we optimized aggressively -- it is because we avoided unnecessary work. No regular expressions. No backtracking. No multi-pass scanning. Just a cursor, a match, and a push.

If FLIN ever needs to lex files large enough for performance to matter (which seems unlikely for a domain-specific language), the first optimization would be string interning, followed by arena allocation for the token vector. But those are bridges we will cross if we reach them.

Lessons Learned

Three lessons from building the lexer:

The modal design was the right call. We considered alternatives: scanning the entire file in code mode and post-processing view sections, or using a separate lexer for view content. Both alternatives would have complicated the parser. The modal lexer produces a unified token stream that the parser can consume without knowing about modes. The complexity stays in one place -- the three mode-transition methods -- instead of leaking across the entire compilation pipeline.

Carry parsed values in tokens. Storing Integer(42) instead of Integer("42") moves parsing work from the parser to the lexer. This is a net win because the lexer has the best context for parsing literals (it knows exactly where the literal starts and ends) and the parser can treat all literals uniformly without type-specific parsing logic.

Test tokens exhaustively before building the parser. The 97 lexer tests took a few hours to write across sessions 1 through 4. They prevented an unknown but certainly nonzero number of parser debugging sessions. In a compiler, bugs in early phases masquerade as bugs in later phases. Testing early phases exhaustively is the cheapest way to keep later phases debuggable.

What Came Next

With the lexer producing a clean, well-tested token stream, we were ready for the parser. But FLIN's parser is not a simple recursive descent parser -- it uses a Pratt parser for expression handling, which gives us operator precedence without a grammar table and extensibility without rewriting the parsing loop.

The next article explains how Pratt parsing works, why we chose it for FLIN, and how it handles everything from simple arithmetic to temporal access operators and AI intent queries.

This is Part 12 of the "How We Built FLIN" series, documenting how a CEO in Abidjan and an AI CTO built a programming language from scratch.

Series Navigation: - [11] Session 1: Project Setup and 42 Keywords - [12] Building a Lexer From Scratch in Rust (you are here) - [13] Pratt Parsing: How FLIN Reads Your Code - [14] The Abstract Syntax Tree: FLIN's Internal Representation - [15] Hindley-Milner Type Inference in a Custom Language

#012 -- Building a Lexer From Scratch in Rust

The Scanner Struct

The Tokenization Loop

Character-by-Character Dispatch

Scanning Strings

Scanning Numbers

Scanning Identifiers and Keywords

Whitespace and Comments

Error Handling

Testing the Lexer

Performance Characteristics

Lessons Learned

What Came Next

Responses

Related Articles

Thirteen Agents, Forty-Three Minutes: The First Claude Fable 5 Workflow Session, And What A Deterministic Orchestration Script Changes About Multi-Agent Builds

The gate caught its own drift: one day inside CASP with Claude Fable 5

The CASP Transplant: How The Six-File Discipline Moved From Conductor To An Anti-Fraud Transport ERP, What The /next Skill Adds When The Operator Just Types 'next', And Why The Cost Of CASP Drift Rises When The Project Is Someone Else's Cash

The Scanner Struct

The Tokenization Loop

Character-by-Character Dispatch

The Modal Lexer: Code, View, and ViewExpression

Scanning Strings

Scanning Numbers

Scanning Identifiers and Keywords

Whitespace and Comments

Error Handling

Testing the Lexer

Performance Characteristics

Lessons Learned

What Came Next

Responses

Related Articles

Thirteen Agents, Forty-Three Minutes: The First Claude Fable 5 Workflow Session, And What A Deterministic Orchestration Script Changes About Multi-Agent Builds

The gate caught its own drift: one day inside CASP with Claude Fable 5

The CASP Transplant: How The Six-File Discipline Moved From Conductor To An Anti-Fraud Transport ERP, What The /next Skill Adds When The Operator Just Types 'next', And Why The Cost Of CASP Drift Rises When The Project Is Someone Else's Cash