#121 -- Document Parsing: PDF, DOCX, CSV, JSON, YAML

RAG is only as good as the data it retrieves from. If your knowledge base is a collection of PDF files, Word documents, and CSV spreadsheets, you need to extract the text before you can embed and search it. In a typical RAG stack, this requires separate parsing libraries: pdfjs-dist for PDFs, mammoth for DOCX, csv-parse for CSV, each with its own API and its own edge cases.

FLIN provides a single parse_document() function that handles all common document formats. Give it a file, get back structured text. No library selection. No format detection. No encoding guessing.

The parse_document() Function

flin// Parse any supported format
result = parse_document("path/to/document.pdf")

// Result structure
result.text       // Full extracted text
result.metadata   // Document metadata (title, author, pages, etc.)
result.format     // Detected format ("pdf", "docx", "csv", etc.)
result.pages      // Array of page texts (for PDF)
result.sections   // Array of sections with headings (for DOCX)
result.rows       // Array of row objects (for CSV)

The return type adapts to the document format while maintaining a common text field that always contains the full extracted text. This text field is what you typically pass to the embedding pipeline.

PDF Parsing

PDF is the most complex format to parse. Text can be in arbitrary positions, fonts can encode characters in non-standard ways, and content can be in images rather than text streams.

flinresult = parse_document("annual-report.pdf")

// Full text
full_text = result.text

// Per-page access
for page in result.pages {
    log_info("Page {page.number}: {page.text.len} characters")
}

// Metadata
title = result.metadata.title        // "2025 Annual Report"
author = result.metadata.author      // "ZeroSuite Inc."
page_count = result.metadata.pages   // 47

The FLIN PDF parser handles: - Text extraction from text streams (the most common case). - Table detection (basic) -- tabular data is extracted with column alignment preserved. - Multi-column layouts -- text is reordered into reading order. - Encoding normalization -- non-standard font encodings are mapped to Unicode.

rustpub fn parse_pdf(path: &Path) -> Result<ParsedDocument, ParseError> {
    let doc = pdf::file::File::open(path)?;
    let mut pages = Vec::new();
    let mut full_text = String::new();

    for page_num in 0..doc.num_pages() {
        let page = doc.get_page(page_num)?;
        let text = extract_page_text(&page)?;
        full_text.push_str(&text);
        full_text.push('\n');
        pages.push(PageContent {
            number: page_num + 1,
            text,
        });
    }

    let metadata = extract_pdf_metadata(&doc);

    Ok(ParsedDocument {
        text: full_text.trim().to_string(),
        format: "pdf".into(),
        metadata,
        pages: Some(pages),
        ..Default::default()
    })
}

DOCX Parsing

Word documents store content in XML inside a ZIP archive. FLIN extracts text, headings, and basic formatting:

flinresult = parse_document("proposal.docx")

// Structured sections
for section in result.sections {
    log_info("## {section.heading}")
    log_info(section.text)
}

// Tables
for table in result.tables {
    for row in table.rows {
        log_info(row.join(" | "))
    }
}

The parser preserves document structure: - Headings are extracted with their level (H1, H2, H3). - Paragraphs are separated by newlines. - Lists are extracted with bullet/number prefixes. - Tables are extracted as arrays of arrays. - Images are noted but not OCR-processed (text content only).

rustpub fn parse_docx(path: &Path) -> Result<ParsedDocument, ParseError> {
    let file = std::fs::File::open(path)?;
    let mut archive = zip::ZipArchive::new(file)?;

    let document_xml = archive.by_name("word/document.xml")?;
    let doc = parse_xml(document_xml)?;

    let mut sections = Vec::new();
    let mut current_heading = String::new();
    let mut current_text = String::new();
    let mut full_text = String::new();

    for element in doc.body.children {
        match element {
            Element::Paragraph(p) if p.is_heading() => {
                if !current_text.is_empty() {
                    sections.push(Section {
                        heading: current_heading.clone(),
                        text: current_text.trim().to_string(),
                    });
                }
                current_heading = p.text();
                current_text = String::new();
                full_text.push_str(&format!("\n## {}\n", current_heading));
            }
            Element::Paragraph(p) => {
                let text = p.text();
                current_text.push_str(&text);
                current_text.push('\n');
                full_text.push_str(&text);
                full_text.push('\n');
            }
            Element::Table(t) => {
                let table_text = format_table(&t);
                current_text.push_str(&table_text);
                full_text.push_str(&table_text);
            }
            _ => {}
        }
    }

    Ok(ParsedDocument {
        text: full_text.trim().to_string(),
        format: "docx".into(),
        sections: Some(sections),
        ..Default::default()
    })
}

CSV Parsing

CSV files are parsed into structured rows with automatic header detection:

flinresult = parse_document("products.csv")

// Access as structured rows
for row in result.rows {
    log_info("{row.name}: ${row.price}")
}

// Full text (all rows concatenated)
full_text = result.text

// Headers
headers = result.metadata.columns  // ["name", "price", "category", ...]
row_count = result.metadata.rows   // 1523

The CSV parser handles: - Header detection -- the first row is used as column names. - Quoted fields -- fields containing commas or newlines within quotes. - Encoding detection -- UTF-8, UTF-16, and Latin-1 are auto-detected. - Delimiter detection -- commas, semicolons, tabs, and pipes.

JSON and YAML Parsing

Structured data formats are parsed and flattened into text for embedding:

flin// JSON
result = parse_document("config.json")
data = result.data    // Parsed JSON as FLIN value

// YAML
result = parse_document("playbook.yaml")
data = result.data    // Parsed YAML as FLIN value

// Both provide text representation for embedding
full_text = result.text  // Flattened key-value text

The text representation flattens nested structures into readable key-value pairs:

json{"user": {"name": "Thales", "email": "[email protected]", "roles": ["admin", "dev"]}}

Becomes:

user.name: Thales
user.email: [email protected]
user.roles: admin, dev

This flattened format produces better embeddings than the raw JSON because it is closer to natural language.

The Document Ingestion Pipeline

Parsing is typically the first step in a document ingestion pipeline:

flin// app/api/documents/upload.flin

guard auth
guard role("admin")

route POST {
    validate {
        file: file @required @max_size("50MB")
            @allow_types("application/pdf",
                "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
                "text/csv", "application/json", "text/yaml")
        title: text @required
        category: text
    }

    // Store the file
    file_path = save_file(body.file, ".flindb/documents/")

    // Parse the document
    parsed = parse_document(file_path)

    // Chunk for embedding
    chunks = chunk_text(parsed.text, {
        max_size: 500,
        overlap: 50,
        strategy: "paragraph"
    })

    // Save document metadata
    doc = Document {
        title: body.title,
        file_path: file_path,
        format: parsed.format,
        page_count: parsed.metadata.pages || 0,
        category: body.category || "general",
        full_text: parsed.text
    }
    save doc

    // Save chunks with embeddings
    for chunk in chunks {
        save DocumentChunk {
            document_id: doc.id,
            content: chunk.text,              // semantic text
            page: chunk.page || 0,
            position: chunk.position
        }
    }

    response {
        status: 201
        body: {
            id: doc.id,
            title: doc.title,
            format: parsed.format,
            chunks: chunks.len,
            pages: parsed.metadata.pages || 0
        }
    }
}

Upload a file. Parse it. Chunk it. Embed and index the chunks. Search across all uploaded documents. The entire pipeline is 40 lines of FLIN code.

Error Handling for Malformed Documents

Documents in the real world are frequently malformed. Password-protected PDFs, corrupted DOCX files, CSV files with inconsistent column counts. The parser handles these gracefully:

flinresult = parse_document("maybe-corrupted.pdf")

if result.error != none {
    log_warn("Parse error: {result.error}")
    // Fall back to raw text extraction
    result = parse_document("maybe-corrupted.pdf", { fallback: "raw" })
}

The fallback: "raw" option extracts whatever text can be found without strict format parsing. For damaged PDFs, this often recovers most of the readable content.

Supported Formats Summary

Format	Extensions	Parser	Notes
PDF	`.pdf`	Custom Rust	Text, tables, metadata
DOCX	`.docx`	ZIP + XML	Headings, paragraphs, tables
CSV	`.csv`, `.tsv`	Custom	Auto-detect delimiter
JSON	`.json`	serde_json	Flattened to text
YAML	`.yaml`, `.yml`	serde_yaml	Flattened to text
Plain text	`.txt`, `.md`	Direct	No parsing needed
HTML	`.html`, `.htm`	Tag stripping	Text content extracted

Format detection uses the file extension first, then falls back to content-type sniffing for ambiguous cases.

Document parsing is the unglamorous foundation of any RAG system. Without clean, structured text extraction, embeddings are poor and search results are irrelevant. FLIN's built-in parsers ensure that the text entering the embedding pipeline is clean, structured, and faithful to the original document.

In the next article, we explore code-aware chunking -- how FLIN splits documents into embedding-appropriate chunks while respecting semantic boundaries.

This is Part 121 of the "How We Built FLIN" series, documenting how a CEO in Abidjan and an AI CTO designed and built a programming language from scratch.

Series Navigation: - [120] RAG: Retrieval, Reranking, and Source Attribution - [121] Document Parsing: PDF, DOCX, CSV, JSON, YAML (you are here) - [122] Code-Aware Chunking for RAG - [123] Hybrid Document Search: BM25 + Semantic

#121 -- Document Parsing: PDF, DOCX, CSV, JSON, YAML

The parse_document() Function

PDF Parsing

DOCX Parsing

CSV Parsing

JSON and YAML Parsing

The Document Ingestion Pipeline

Error Handling for Malformed Documents

Supported Formats Summary

Responses

Related Articles

Thirteen Agents, Forty-Three Minutes: The First Claude Fable 5 Workflow Session, And What A Deterministic Orchestration Script Changes About Multi-Agent Builds

The gate caught its own drift: one day inside CASP with Claude Fable 5

The CASP Transplant: How The Six-File Discipline Moved From Conductor To An Anti-Fraud Transport ERP, What The /next Skill Adds When The Operator Just Types 'next', And Why The Cost Of CASP Drift Rises When The Project Is Someone Else's Cash