Embedding models have context windows. The best open-source models accept 512 tokens. Some accept 8,192. None accept 50,000. When a user uploads a 50-page PDF, you cannot embed the entire document as a single vector. The embedding model would truncate the text silently, and the resulting vector would represent only the first few pages -- making the rest of the document invisible to semantic search.
Chunking solves this by splitting documents into pieces that fit within the embedding model's context window. But chunking is not just "split every N characters." Bad chunking destroys meaning. Split a sentence in half and neither half makes sense to an embedding model. Split in the middle of a code block and you lose the function signature that gives the body its meaning.
Session 221 implemented FLIN's chunking module: the bridge between document extraction (which produces raw text) and embedding generation (which requires bounded-length inputs). Twenty tests, 400 lines of Rust, and two chunking strategies that handle everything from legal contracts to source code.
The Document Intelligence Pipeline
Chunking sits at stage 3 of FLIN's document intelligence pipeline:
1. Upload File -> body.file (PDF, DOCX, etc.)
2. Extract Text -> document_extract(file) -> raw text
3. Chunk Text -> chunk_text(text, options) -> [Chunk]
4. Embed Chunks -> embed(chunk.text) -> vector
5. Store Vectors -> VectorStore.store_embedding()
6. Semantic Search -> search "query" in Entity by fieldStages 1 and 2 were complete. Stage 4 and 5 existed from the embedding system built earlier. Stage 3 was the missing piece. Without chunking, the only way to embed a document was to truncate it to fit the model's context window -- losing most of the content.
The Chunk Struct
Every chunk carries metadata about its position in the original document:
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct Chunk {
pub text: String, // The chunk content
pub index: usize, // Position in the chunk sequence (0, 1, 2, ...)
pub start_offset: usize, // Character offset in original text
pub end_offset: usize, // Character offset in original text
pub total_chunks: usize, // Total number of chunks
}The offsets enable source attribution. When semantic search returns a chunk, the application can highlight the exact passage in the original document. The index and total count enable display logic like "showing result from chunk 7 of 23."
Recursive Character Chunking
The primary chunking strategy is recursive character chunking. It splits text at character boundaries with configurable chunk size and overlap:
pub struct ChunkOptions {
pub chunk_size: usize, // Target chunk size in characters
pub overlap: usize, // Overlap between consecutive chunks
pub respect_word_boundaries: bool, // Split at word boundaries
pub respect_sentence_boundaries: bool, // Split at sentence ends
}impl ChunkOptions { pub fn new(chunk_size: usize, overlap: usize) -> Self { Self { chunk_size, overlap, respect_word_boundaries: true, respect_sentence_boundaries: false, } } } ```
The default configuration is 1,000 characters per chunk with 200 characters of overlap. These numbers come from empirical observation across different embedding models:
- 1,000 characters is approximately 250 tokens, well within the 512-token window of models like all-MiniLM-L6-v2.
- 200 characters of overlap (20%) ensures that concepts split across chunk boundaries are still captured in at least one chunk.
Why Characters, Not Tokens
Chunk size is measured in characters, not tokens. This is a deliberate choice. Token counts depend on the tokenizer, which depends on the model. A chunk of 1,000 characters might be 230 tokens with one model and 280 with another. Character-based sizing is model-independent and produces consistent behavior regardless of which embedding model is configured.
pub fn chunk_text(text: &str, options: &ChunkOptions) -> Vec<Chunk> {
if text.is_empty() {
return vec![];
}let char_count = text.chars().count(); if char_count <= options.chunk_size { return vec![Chunk { text: text.to_string(), index: 0, start_offset: 0, end_offset: char_count, total_chunks: 1, }]; }
let mut chunks = Vec::new();
let mut start = 0;
let chars: Vec
while start < chars.len() { let mut end = (start + options.chunk_size).min(chars.len());
// Respect word boundaries: backtrack to last space if options.respect_word_boundaries && end < chars.len() { if let Some(space_pos) = find_last_space(&chars, start, end) { end = space_pos + 1; } }
let chunk_text: String = chars[start..end].iter().collect(); chunks.push(Chunk { text: chunk_text, index: chunks.len(), start_offset: start, end_offset: end, total_chunks: 0, // Set after all chunks are created });
// Advance with overlap start = if end >= chars.len() { chars.len() } else { end - options.overlap.min(end - start) }; }
// Set total_chunks on all chunks let total = chunks.len(); for chunk in &mut chunks { chunk.total_chunks = total; }
chunks } ```
UTF-8 Safety
The function uses chars().count() instead of len() and indexes into a Vec rather than byte slices. This is critical for multilingual content. A French document with accented characters, a Japanese document with multi-byte kanji, or a document with emoji all chunk correctly because the algorithm operates on Unicode code points, not bytes.
A byte-based approach would split in the middle of a multi-byte character, producing invalid UTF-8. The character-based approach costs a small performance penalty (collecting into a Vec) but guarantees correctness across all languages.
Sentence-Boundary Chunking
The second strategy splits at sentence boundaries. This produces chunks that are more semantically coherent because each chunk contains complete sentences:
pub fn chunk_by_sentences(text: &str, options: &ChunkOptions) -> Vec<Chunk> {
let sentences = split_into_sentences(text);
let mut chunks = Vec::new();
let mut current_text = String::new();
let mut current_start = 0;for sentence in &sentences { if current_text.chars().count() + sentence.chars().count() > options.chunk_size && !current_text.is_empty() { chunks.push(Chunk { text: current_text.trim().to_string(), index: chunks.len(), start_offset: current_start, end_offset: current_start + current_text.chars().count(), total_chunks: 0, }); current_start += current_text.chars().count() - options.overlap; current_text = String::new(); } current_text.push_str(sentence); }
// Handle remaining text if !current_text.trim().is_empty() { chunks.push(Chunk { text: current_text.trim().to_string(), index: chunks.len(), start_offset: current_start, end_offset: current_start + current_text.chars().count(), total_chunks: 0, }); }
let total = chunks.len(); for chunk in &mut chunks { chunk.total_chunks = total; }
chunks } ```
Sentence chunking is better for prose -- articles, legal documents, essays -- where meaning is organized in sentences and paragraphs. Character chunking is better for unstructured text or text where sentence detection is unreliable (log files, CSV data, code).
Overlap: Why It Matters
Consider a document about photosynthesis that says: "Chlorophyll absorbs red and blue light. This absorbed energy drives the conversion of carbon dioxide and water into glucose."
If the chunk boundary falls between these two sentences with no overlap, a search for "how chlorophyll produces glucose" might miss the connection because no single chunk contains both "chlorophyll" and "glucose." With overlap, the second chunk includes the first sentence as context, preserving the relationship.
// Without overlap: search might miss connections
chunk_1: "...Chlorophyll absorbs red and blue light."
chunk_2: "This absorbed energy drives the conversion..."// With 200-char overlap: connections preserved chunk_1: "...Chlorophyll absorbs red and blue light." chunk_2: "...absorbs red and blue light. This absorbed energy drives..." ```
The trade-off is storage. A 200-character overlap means approximately 20% more chunks and 20% more embeddings. For most applications, this cost is negligible compared to the quality improvement in search results.
Estimating Tokens and Optimal Chunk Size
The module includes helpers for adapting chunk size to specific embedding models:
pub fn estimate_tokens(text: &str) -> usize {
// Approximate: 1 token per 4 characters for English
// This is a rough estimate; actual token count depends on the tokenizer
(text.chars().count() + 3) / 4
}pub fn optimal_chunk_size(model_context_window: usize) -> usize { // Use 80% of context window to leave room for overhead let target_tokens = (model_context_window as f64 * 0.8) as usize; // Convert tokens to characters (4 chars per token) target_tokens * 4 } ```
These are approximations. The 4-characters-per-token ratio is a reasonable average for English text with common tokenizers (GPT-style BPE, SentencePiece). For languages with more complex morphology or non-Latin scripts, the ratio can vary significantly. The 80% utilization factor leaves headroom for tokenizer-specific overhead and special tokens.
Default Configuration
For developers who do not want to think about chunk parameters, FLIN provides sensible defaults:
pub fn chunk_text_default(text: &str) -> Vec<Chunk> {
chunk_text(text, &ChunkOptions::new(1000, 200))
}One function call, no configuration. The defaults work well for English text with embedding models in the 384-to-1024 dimension range. For specialized use cases -- very short documents, very long documents, non-English text -- the full ChunkOptions API is available.
Connecting to the Pipeline
In FLIN application code, chunking is typically invisible. The semantic text field type triggers automatic chunking and embedding when a document is saved:
entity KnowledgeArticle {
title: text
content: semantic text
}article = KnowledgeArticle.create({ title: "Introduction to SYSCOHADA", content: extracted_text // From document_extract() }) save article // Automatically: chunk -> embed -> store vectors ```
Behind the scenes, the runtime chunks the content, generates an embedding for each chunk, and stores the embeddings in the vector store with field names like content__chunk_0, content__chunk_1, and so on. When a semantic search query arrives, the search system queries across all chunks and returns results with source attribution back to the original entity.
The chunking module was built in 25 minutes during Session 221. Twenty tests verify every edge case: empty text, text shorter than chunk size, Unicode text, paragraph text, overlap behavior, and metadata consistency. The module is small (400 lines) but critical -- it is the bridge that makes semantic search work on real documents, not just toy examples.
In the next article, we wire the chunking module to the embedding system, completing the pipeline from raw document bytes to searchable vectors.
---
This is Part 130 of the "How We Built FLIN" series, documenting how a CEO in Abidjan and an AI CTO designed and built a programming language from scratch.
Series Navigation: - [129] Download Grants and Access Keys - [130] Text Chunking Strategies (you are here) - [131] Chunk-Embedding Integration