RAG is only as good as the data it retrieves from. If your knowledge base is a collection of PDF files, Word documents, and CSV spreadsheets, you need to extract the text before you can embed and search it. In a typical RAG stack, this requires separate parsing libraries: pdfjs-dist for PDFs, mammoth for DOCX, csv-parse for CSV, each with its own API and its own edge cases.
FLIN provides a single parse_document() function that handles all common document formats. Give it a file, get back structured text. No library selection. No format detection. No encoding guessing.
The parse_document() Function
// Parse any supported format
result = parse_document("path/to/document.pdf")// Result structure result.text // Full extracted text result.metadata // Document metadata (title, author, pages, etc.) result.format // Detected format ("pdf", "docx", "csv", etc.) result.pages // Array of page texts (for PDF) result.sections // Array of sections with headings (for DOCX) result.rows // Array of row objects (for CSV) ```
The return type adapts to the document format while maintaining a common text field that always contains the full extracted text. This text field is what you typically pass to the embedding pipeline.
PDF Parsing
PDF is the most complex format to parse. Text can be in arbitrary positions, fonts can encode characters in non-standard ways, and content can be in images rather than text streams.
result = parse_document("annual-report.pdf")// Full text full_text = result.text
// Per-page access for page in result.pages { log_info("Page {page.number}: {page.text.len} characters") }
// Metadata title = result.metadata.title // "2025 Annual Report" author = result.metadata.author // "ZeroSuite Inc." page_count = result.metadata.pages // 47 ```
The FLIN PDF parser handles: - Text extraction from text streams (the most common case). - Table detection (basic) -- tabular data is extracted with column alignment preserved. - Multi-column layouts -- text is reordered into reading order. - Encoding normalization -- non-standard font encodings are mapped to Unicode.
pub fn parse_pdf(path: &Path) -> Result<ParsedDocument, ParseError> {
let doc = pdf::file::File::open(path)?;
let mut pages = Vec::new();
let mut full_text = String::new();for page_num in 0..doc.num_pages() { let page = doc.get_page(page_num)?; let text = extract_page_text(&page)?; full_text.push_str(&text); full_text.push('\n'); pages.push(PageContent { number: page_num + 1, text, }); }
let metadata = extract_pdf_metadata(&doc);
Ok(ParsedDocument { text: full_text.trim().to_string(), format: "pdf".into(), metadata, pages: Some(pages), ..Default::default() }) } ```
DOCX Parsing
Word documents store content in XML inside a ZIP archive. FLIN extracts text, headings, and basic formatting:
result = parse_document("proposal.docx")// Structured sections for section in result.sections { log_info("## {section.heading}") log_info(section.text) }
// Tables for table in result.tables { for row in table.rows { log_info(row.join(" | ")) } } ```
The parser preserves document structure: - Headings are extracted with their level (H1, H2, H3). - Paragraphs are separated by newlines. - Lists are extracted with bullet/number prefixes. - Tables are extracted as arrays of arrays. - Images are noted but not OCR-processed (text content only).
pub fn parse_docx(path: &Path) -> Result<ParsedDocument, ParseError> {
let file = std::fs::File::open(path)?;
let mut archive = zip::ZipArchive::new(file)?;let document_xml = archive.by_name("word/document.xml")?; let doc = parse_xml(document_xml)?;
let mut sections = Vec::new(); let mut current_heading = String::new(); let mut current_text = String::new(); let mut full_text = String::new();
for element in doc.body.children { match element { Element::Paragraph(p) if p.is_heading() => { if !current_text.is_empty() { sections.push(Section { heading: current_heading.clone(), text: current_text.trim().to_string(), }); } current_heading = p.text(); current_text = String::new(); full_text.push_str(&format!("\n## {}\n", current_heading)); } Element::Paragraph(p) => { let text = p.text(); current_text.push_str(&text); current_text.push('\n'); full_text.push_str(&text); full_text.push('\n'); } Element::Table(t) => { let table_text = format_table(&t); current_text.push_str(&table_text); full_text.push_str(&table_text); } _ => {} } }
Ok(ParsedDocument { text: full_text.trim().to_string(), format: "docx".into(), sections: Some(sections), ..Default::default() }) } ```
CSV Parsing
CSV files are parsed into structured rows with automatic header detection:
result = parse_document("products.csv")// Access as structured rows for row in result.rows { log_info("{row.name}: ${row.price}") }
// Full text (all rows concatenated) full_text = result.text
// Headers headers = result.metadata.columns // ["name", "price", "category", ...] row_count = result.metadata.rows // 1523 ```
The CSV parser handles: - Header detection -- the first row is used as column names. - Quoted fields -- fields containing commas or newlines within quotes. - Encoding detection -- UTF-8, UTF-16, and Latin-1 are auto-detected. - Delimiter detection -- commas, semicolons, tabs, and pipes.
JSON and YAML Parsing
Structured data formats are parsed and flattened into text for embedding:
// JSON
result = parse_document("config.json")
data = result.data // Parsed JSON as FLIN value// YAML result = parse_document("playbook.yaml") data = result.data // Parsed YAML as FLIN value
// Both provide text representation for embedding full_text = result.text // Flattened key-value text ```
The text representation flattens nested structures into readable key-value pairs:
{"user": {"name": "Thales", "email": "[email protected]", "roles": ["admin", "dev"]}}Becomes:
user.name: Thales
user.email: [email protected]
user.roles: admin, devThis flattened format produces better embeddings than the raw JSON because it is closer to natural language.
The Document Ingestion Pipeline
Parsing is typically the first step in a document ingestion pipeline:
// app/api/documents/upload.flinguard auth guard role("admin")
route POST { validate { file: file @required @max_size("50MB") @allow_types("application/pdf", "application/vnd.openxmlformats-officedocument.wordprocessingml.document", "text/csv", "application/json", "text/yaml") title: text @required category: text }
// Store the file file_path = save_file(body.file, ".flindb/documents/")
// Parse the document parsed = parse_document(file_path)
// Chunk for embedding chunks = chunk_text(parsed.text, { max_size: 500, overlap: 50, strategy: "paragraph" })
// Save document metadata doc = Document { title: body.title, file_path: file_path, format: parsed.format, page_count: parsed.metadata.pages || 0, category: body.category || "general", full_text: parsed.text } save doc
// Save chunks with embeddings for chunk in chunks { save DocumentChunk { document_id: doc.id, content: chunk.text, // semantic text page: chunk.page || 0, position: chunk.position } }
response { status: 201 body: { id: doc.id, title: doc.title, format: parsed.format, chunks: chunks.len, pages: parsed.metadata.pages || 0 } } } ```
Upload a file. Parse it. Chunk it. Embed and index the chunks. Search across all uploaded documents. The entire pipeline is 40 lines of FLIN code.
Error Handling for Malformed Documents
Documents in the real world are frequently malformed. Password-protected PDFs, corrupted DOCX files, CSV files with inconsistent column counts. The parser handles these gracefully:
result = parse_document("maybe-corrupted.pdf")if result.error != none { log_warn("Parse error: {result.error}") // Fall back to raw text extraction result = parse_document("maybe-corrupted.pdf", { fallback: "raw" }) } ```
The fallback: "raw" option extracts whatever text can be found without strict format parsing. For damaged PDFs, this often recovers most of the readable content.
Supported Formats Summary
| Format | Extensions | Parser | Notes |
|---|---|---|---|
.pdf | Custom Rust | Text, tables, metadata | |
| DOCX | .docx | ZIP + XML | Headings, paragraphs, tables |
| CSV | .csv, .tsv | Custom | Auto-detect delimiter |
| JSON | .json | serde_json | Flattened to text |
| YAML | .yaml, .yml | serde_yaml | Flattened to text |
| Plain text | .txt, .md | Direct | No parsing needed |
| HTML | .html, .htm | Tag stripping | Text content extracted |
Format detection uses the file extension first, then falls back to content-type sniffing for ambiguous cases.
Document parsing is the unglamorous foundation of any RAG system. Without clean, structured text extraction, embeddings are poor and search results are irrelevant. FLIN's built-in parsers ensure that the text entering the embedding pipeline is clean, structured, and faithful to the original document.
In the next article, we explore code-aware chunking -- how FLIN splits documents into embedding-appropriate chunks while respecting semantic boundaries.
---
This is Part 121 of the "How We Built FLIN" series, documenting how a CEO in Abidjan and an AI CTO designed and built a programming language from scratch.
Series Navigation: - [120] RAG: Retrieval, Reranking, and Source Attribution - [121] Document Parsing: PDF, DOCX, CSV, JSON, YAML (you are here) - [122] Code-Aware Chunking for RAG - [123] Hybrid Document Search: BM25 + Semantic