Sponsored by

The result is a deterministic, GPU-free parser with XY-Cut++ reading order reconstruction, bounding-box-annotated JSON output, and a Hybrid Mode that routes complex pages to AI backends while keeping simple pages on the fast local path at 0.015 seconds per page.

SnackOnAI Engineering | Senior AI Systems Researcher | Technical Deep Dive | June 20, 2026

A PDF file is not a document. It is a set of graphics instructions that tell a renderer where to draw characters, lines, and images on a page. There is no concept of "paragraph," "heading," or "table" in the PDF format itself. When you call PyMuPDF or pdfminer and get a string back, you are getting the text in the order those drawing instructions happened to appear, which is not necessarily the reading order, and is definitely not the semantic structure.

This creates a compounding problem for RAG pipelines. You chunk the string. The chunk boundary falls in the middle of a table. You retrieve the chunk. The LLM gets half a table, no headers, and no context about what section it came from. Your retrieval accuracy suffers not because your embedding model is bad but because your document parser threw away the structure that would have made chunking meaningful.

OpenDataLoader PDF (Apache 2.0) addresses this at the parsing layer rather than the retrieval layer. Its core contribution is a Java-based pipeline that transforms raw PDF drawing instructions into a semantic object tree, reconstructs reading order using XY-Cut++, detects structural elements (headings, tables, lists, figures, captions), and outputs the result as either structure-annotated Markdown ready for chunking, JSON with per-element bounding boxes for source citation, or Tagged PDF for accessibility compliance. For pages too complex for the deterministic path (borderless tables, scanned content), a Hybrid Mode routes them to configurable AI backends without touching the simple pages.

The academic context: IBM's DocLayNet dataset (arXiv:2206.01062) defines 11 document element classes across 80,863 annotated pages. Microsoft's LayoutLM family (arXiv:1912.13318, arXiv:2204.08387) demonstrated that combining text tokens with 2D positional embeddings during pretraining yields significant gains on document understanding tasks (LayoutLMv3 reaches 92.08 F1 on FUNSD, 93.07 on CORD). OpenDataLoader sits at the practical intersection of this research: it applies the structural understanding insight deterministically in production, without requiring a fine-tuned foundation model for the common case.

Scope: the core Java pipeline architecture, XY-Cut++ reading order reconstruction, JSON output with bounding boxes, Hybrid Mode routing logic, Tagged PDF output for accessibility, and LangChain integration. Not covered: the enterprise PDF/UA compliance add-on, or the detailed internals of each AI backend in Hybrid Mode.

What It Actually Does

OpenDataLoader PDF converts PDFs to three primary AI-ready formats:

  • JSON with bounding boxes: each detected element has type, page number, text content, and (x1, y1, x2, y2) bounding box coordinates. Enables source citation that points back to the exact page region.

  • Markdown: heading hierarchy preserved, tables rendered as Markdown tables, lists maintained as list items, reading order correct for multi-column layouts.

  • Tagged PDF: accessibility output, untagged PDF in, tagged PDF out, Apache 2.0.

Installation:

# Python (requires Java 11+)
pip install opendataloader-pdf

# Node.js
npm install @opendataloader/pdf

# Java (direct)
# Maven/Gradle dependency: org.opendataloader:opendataloader-pdf-core

The JVM process boundary is the key operational fact. Each convert() call spawns a new JVM process. On Python and Node.js, this adds approximately 1-2 seconds of startup overhead per invocation, independent of document size. For processing large batches, the correct pattern is to invoke the CLI directly or to use the Java API directly. For interactive single-document use, the overhead is acceptable. This is not a library for tight latency-critical loops called from Python scripts processing one page at a time.

The Architecture, Unpacked

Focus on the triage layer. The design decision to route per-page rather than per-document is what makes the hybrid mode practical: a 100-page report with 90 text pages and 10 complex tables gets 90 pages at 0.015s/page locally and only 10 pages sent to an AI backend. The cost and latency of AI-assisted parsing is proportional to actual complexity, not document size.

The Code, Annotated

Snippet One: Python API, Basic Extraction and JSON with Bounding Boxes

# OpenDataLoader PDF: structured extraction for RAG pipelines
# Source: opendataloader-project/opendataloader-pdf (Apache 2.0)
# Shows the three main output patterns for AI/RAG use cases

import json
from pathlib import Path
import opendataloader_pdf as odl

# ─── PATTERN 1: Markdown for chunking ──────────────────────────────────────────
# ← For RAG, Markdown is the right chunking input because:
#   - Heading hierarchy preserved (#, ##, ###)
#   - Tables formatted as Markdown tables (not collapsed to strings)
#   - Reading order correct for multi-column layouts (XY-Cut++)
#   - List items preserved as list structure, not concatenated

result_md = odl.convert(
    "financial_report_q1.pdf",
    format="md",
    # ← NO gpu required. Runs deterministically locally.
    #   This is the key difference from ML-based parsers:
    #   reproducible output, no model version drift
)
print(result_md.text)  # Markdown string with full structure preserved


# ─── PATTERN 2: JSON with bounding boxes for source citation ───────────────────
# ← This is the right format when you need to cite sources in RAG responses.
#   Each element has coordinates pointing back to the exact PDF region.
#   Without this, RAG systems can only say "from page 3" at best.

result_json = odl.convert(
    "financial_report_q1.pdf",
    format="json",
)
elements = json.loads(result_json.text)

# ← THIS is the trick: bounding boxes enable citation at the element level
#   You can construct a source reference like:
#   "Section 3.2, page 7, [left: 72, top: 312, right: 540, bottom: 428]"
#   and use that to render a highlighted PDF for the user
for element in elements["elements"]:
    print(f"Type: {element['type']}")          # heading, table, list_item, etc.
    print(f"Page: {element['page']}")          # 1-indexed page number
    print(f"Text: {element['text'][:100]}")    # content
    print(f"BBox: {element['bbox']}")          # {x1, y1, x2, y2}
    print()

# Example JSON output for a table element:
# {
#   "type": "table",
#   "page": 7,
#   "text": "| Quarter | Revenue | YoY % |\n|---------|---------|-------|\n...",
#   "bbox": {"x1": 72, "y1": 312, "x2": 540, "y2": 428},
#   "rows": [["Quarter", "Revenue", "YoY %"], ["Q1 2026", "$4.2B", "+12%"], ...],
#   "reading_order": 15   # position in document reading order
# }


# ─── PATTERN 3: Hybrid Mode for complex documents ─────────────────────────────
# ← Use when your documents have:
#   - Borderless tables (no visible grid lines, relies on text alignment)
#   - Scanned pages (requires OCR)
#   - Dense mathematical content or formulas
#   - Low-quality or rotated scans

result_hybrid = odl.convert(
    "scanned_contract.pdf",
    format="json",
    hybrid={
        "backend": "azure",      # or: docling, hancom, google
        "triage": "auto",        # ← auto: AI only for complex pages
        # "triage": "full",      # ← full: all pages go to AI backend
    }
)
# With triage="auto":
#   Simple text pages: ~0.015s/page (Java local)
#   Complex/scanned pages: AI backend latency (varies by backend)
# The merged output is identical in format regardless of which path each page took.

The triage="auto" setting is the operational key. For a 100-page document where 85 pages are standard text and 15 are scanned or use borderless tables, only those 15 pages incur AI backend latency. The 85 simple pages run at 0.015s each, completing in ~1.3 seconds locally while the complex pages are processed concurrently by the backend.

Snippet Two: LangChain Integration and RAG Pipeline

# OpenDataLoader PDF: LangChain integration for RAG pipelines
# Source: opendataloader-project/opendataloader-pdf docs (Apache 2.0)
# Shows how structure-aware parsing changes chunking strategy

from langchain.text_splitter import MarkdownHeaderTextSplitter
from langchain_community.document_loaders import OpenDataLoaderPDFLoader
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS

# ─── NAIVE APPROACH (what most teams do) ───────────────────────────────────────
# This collapses everything to text, loses structure, forces blind character chunking
import fitz  # PyMuPDF
doc = fitz.open("technical_spec.pdf")
naive_text = " ".join(page.get_text() for page in doc)
# ← Problems: reading order wrong for multi-column layouts,
#   table cells concatenated into garbage strings,
#   no heading context for any chunk


# ─── STRUCTURE-AWARE APPROACH (OpenDataLoader) ─────────────────────────────────
# ← THIS is the trick: use Markdown output to get heading-aware chunking
#   MarkdownHeaderTextSplitter creates chunks that preserve their section context

loader = OpenDataLoaderPDFLoader(
    file_path="technical_spec.pdf",
    output_format="markdown",  # ← preserves heading hierarchy
)
documents = loader.load()

# Split by Markdown headings: each chunk knows which section it belongs to
splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=[
        ("#", "h1"),    # Document title
        ("##", "h2"),   # Major section
        ("###", "h3"),  # Subsection
    ]
)
chunks = splitter.split_text(documents[0].page_content)

# Each chunk now has metadata:
# chunk.metadata = {"h1": "Technical Specification v3.2", "h2": "Section 4: API Reference"}
# ← This metadata is the section context that makes retrieval answers citable
# ← Without structure-aware parsing, you'd have to infer section membership
#   from proximity to headings, which is unreliable in multi-column layouts

print(f"Chunks created: {len(chunks)}")
# Output: Chunks created: 47
# vs naive character splitting: would create arbitrary number of decontextualized chunks

# ─── BUILD VECTOR STORE ────────────────────────────────────────────────────────
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(chunks, embeddings)

# ─── RETRIEVAL WITH SOURCE CITATION ───────────────────────────────────────────
# For citation, switch to JSON format to get bounding boxes
loader_json = OpenDataLoaderPDFLoader(
    file_path="technical_spec.pdf",
    output_format="json",
)
json_doc = loader_json.load()
elements = json_doc[0].metadata["elements"]

def find_source_location(text_excerpt: str, elements: list) -> dict:
    """
    Given a retrieved text chunk, find its exact location in the original PDF.
    ← This is what bounding boxes enable: not just "page 7" but
      "page 7, top-left corner, here is the highlighted region"
    """
    for el in elements:
        if text_excerpt[:50] in el.get("text", ""):
            return {
                "page": el["page"],
                "bbox": el["bbox"],
                "element_type": el["type"],
                "pdf_highlight_url": f"document.pdf#page={el['page']}&bbox={el['bbox']}"
            }
    return {}

The MarkdownHeaderTextSplitter pairing with OpenDataLoader's Markdown output is the composing pattern that makes RAG citation actually work. The alternative, chunking raw text without heading context, produces chunks that may contain the right words but cannot tell the retriever (or the LLM) which section they came from. Structure-aware parsing moves that problem from retrieval to parsing, where it is much easier to solve.

It In Action: End-to-End Worked Example

Input: A 47-page investment research report, two-column layout, with embedded tables showing financial metrics and footnotes at the bottom of each page.

Step 1: Triage

java -jar opendataloader-pdf.jar \
  --input research_report.pdf \
  --format json \
  --hybrid-backend docling \
  --triage auto

Page-by-page triage (logged with --verbose):
  Pages 1-3:   COVER + TOC → LOCAL (text-only, no tables)
  Pages 4-38:  BODY TEXT (2-column) → LOCAL (XY-Cut++ handles column order)
  Pages 39-41: FINANCIAL TABLES → LOCAL (bordered tables, border analysis works)
  Pages 42-44: COMPARISON TABLES (borderless) → HYBRID/docling
  Pages 45-47: APPENDIX → LOCAL

Local pages: 44 (at ~0.015s each) = ~0.66s
Hybrid pages: 3 (docling backend) = ~4.2s concurrently
Total: ~5.0s end-to-end for 47 pages

Step 2: JSON output structure for a financial table

{
  "elements": [
    {
      "type": "table",
      "page": 40,
      "reading_order": 87,
      "text": "| Company | Q1 Revenue | Q2 Revenue | YoY % |\n|---------|-----------|-----------|-------|\n| NVDA    | $22.1B    | $28.7B    | +29.9%|\n| INTC    | $12.7B    | $13.1B    | +3.1% |",
      "bbox": {"x1": 72, "y1": 256, "x2": 540, "y2": 384},
      "rows": [
        ["Company", "Q1 Revenue", "Q2 Revenue", "YoY %"],
        ["NVDA", "$22.1B", "$28.7B", "+29.9%"],
        ["INTC", "$12.7B", "$13.1B", "+3.1%"]
      ],
      "headers": ["Company", "Q1 Revenue", "Q2 Revenue", "YoY %"]
    },
    {
      "type": "heading",
      "page": 39,
      "reading_order": 84,
      "level": 2,
      "text": "Section 4: Revenue Comparison by Company",
      "bbox": {"x1": 72, "y1": 124, "x2": 540, "y2": 148}
    },
    {
      "type": "footnote",
      "page": 40,
      "reading_order": 89,
      "text": "¹ Revenue figures sourced from public SEC filings Q1-Q2 2026.",
      "bbox": {"x1": 72, "y1": 712, "x2": 400, "y2": 726}
    }
  ]
}

Step 3: Reading order in two-column layout

Naive parser output (character extraction order):
  "Section 4 [column 1 para 1] [column 2 para 1] [column 1 para 2] [column 2 para 2]"
  → Two columns interleaved: incoherent chunks, broken sentences

OpenDataLoader XY-Cut++ output:
  "Section 4 [column 1 para 1] [column 1 para 2] [column 2 para 1] [column 2 para 2]"
  → Each column fully extracted before the next: coherent chunks

XY-Cut++ logic: partition page into vertical and horizontal regions recursively,
  read each partition as a single contiguous text zone before crossing to the
  next zone. This is a deterministic algorithm, no model weights involved.

Step 4: RAG retrieval result comparison

Query: "What was NVDA's Q2 2026 revenue?"

Naive parser (PyMuPDF text extraction, fixed-size chunking):
  Retrieved chunk: "Q1 Revenue Q2 Revenue YoY % NVDA INTC $22.1B $28.7B +29.9%"
  (column headers on different line from values due to PDF layout order)
  LLM confusion: headers and values interleaved

OpenDataLoader JSON (table as structured rows):
  Retrieved element: table with parsed rows, headers identified separately
  LLM response: "$28.7B, Q2 2026. Source: page 40, Section 4."
  Citation: bbox {x1:72, y1:256, x2:540, y2:384} on page 40

Retrieval improvement: structured table parsing eliminates the most common
  table-in-RAG failure mode without any retrieval-layer changes.

Why This Design Works, and What It Trades Away

The Java-core-with-thin-wrappers architecture is the correct tradeoff for a document parsing tool. Java's PDF ecosystem is mature (Apache PDFBox is the foundation for much of this tooling), deterministic, and runs on any platform with a JVM without GPU or ML model dependencies. The Python and Node.js wrappers give the library reach into the ecosystems where AI engineers actually work. The build-time JAR embedding (via hatch_build.py and setup.cjs) means there is no separate Java installation step for the wrappers, only the runtime requirement for JVM.

XY-Cut++ is the right reading order algorithm for multi-column documents because it makes globally correct decisions about page partitioning rather than locally greedy left-to-right decisions. A naive character-order extractor reads wherever the PDF drawing instruction places the next character. XY-Cut++ first identifies the column structure of the entire page, then reads each column in sequence. The result for two-column academic papers and financial reports is correct reading order without any model training.

The Hybrid Mode's per-page triage is the correct cost engineering decision. Routing entire documents to AI backends for borderless table handling would make the parser expensive and slow for all documents. Routing only the pages that actually need it, determined by the local engine's complexity classifier, keeps the median case fast (0.015s/page locally) while handling edge cases (borderless tables, scanned pages) correctly.

The Tagged PDF output path is the strategically differentiated capability. Docling, PyMuPDF, pdfminer, and Camelot all output text, Markdown, or JSON. None of them outputs Tagged PDF under an open-source license because writing structure tags back into a PDF file requires either a licensed PDF SDK or a complete open-source PDF writer that understands the Tagged PDF spec. OpenDataLoader is the first open-source tool to close this loop end-to-end.

What OpenDataLoader trades away:

The JVM subprocess overhead is the practical friction that prevents use in tight per-document loops from Python or Node.js. If you are processing 10,000 PDFs in a Python batch job and calling convert() in a loop, you will pay 1-2 seconds of JVM startup for each one, regardless of document size. The correct mitigation is to use the CLI in shell scripts, use the Java API directly in a Java service, or batch all your documents into a single multi-file invocation.

The Hybrid Mode AI backends are not bundled: they require separate setup (API keys for Azure or Google, local installation for docling or hancom). The local Java engine handles the common case; the hybrid backends require separate provisioning and incur API costs or hosting overhead.

Native scanned PDF OCR quality depends on which hybrid backend you configure. The local Java engine does not include a bundled OCR model. For scanned documents in hybrid mode, the OCR quality is that of the backend you configure (docling, Azure Form Recognizer, Google Document AI), not of OpenDataLoader itself.

Technical Moats

The reading order reconstruction at the algorithmic layer. XY-Cut++ produces correct multi-column reading order without training data, without a model, and without GPU. ML-based approaches (LayoutLM, LayoutLMv3) learn reading order as a downstream consequence of layout understanding. Deterministic XY-Cut++ gets there directly for the cases where the layout is well-structured (most native PDFs are). IBM's DocLayNet dataset (arXiv:2206.01062) exists precisely because human annotation of layout elements is expensive and layout varies enough across document types to require large labeled datasets. OpenDataLoader's deterministic path avoids this data dependency entirely for native PDFs.

The Tagged PDF output under Apache 2.0. The legal and regulatory moat here is real: PDF accessibility regulations (Section 508 in the US, EN 301 549 in the EU, PDF/UA ISO standards) now apply to a wide range of organizations, and manual PDF remediation at $50-200 per document does not scale. The combination of automated layout analysis and open-source tagged PDF output has not been available before because writing structure tags into PDF requires either a proprietary SDK or a complete open-source PDF writer that the community has not previously built for this purpose.

The LangChain integration with structured Markdown output. Most PDF-to-Markdown converters flatten the document. OpenDataLoader's Markdown output preserves heading hierarchy in a way that directly enables MarkdownHeaderTextSplitter to produce heading-contextualized chunks. This composability with the existing LangChain ecosystem is a practical distribution moat: engineers building RAG pipelines can drop in OpenDataLoader as the first stage with three lines of code and immediately get better chunks.

Insights

Insight One: The performance bottleneck for AI-ready document processing is not the AI part. The bottleneck is reading order reconstruction and table structure recovery, which are deterministic layout analysis problems, not learning problems. LayoutLMv3 achieving 92.08 F1 on FUNSD (arXiv:2204.08387) is impressive, but FUNSD is a form understanding dataset, not a reading order benchmark for multi-column scientific papers or financial reports. For the majority of production PDF documents (native PDFs with standard layouts), deterministic XY-Cut++ at 0.015s/page outperforms any model-based approach on both speed and cost, and produces comparable structural accuracy. The cases where models are genuinely needed, borderless tables, rotated content, complex mixed layouts, are a minority of real production PDF corpora and are handled by the Hybrid Mode.

Insight Two: The PDF accessibility angle is not just a feature addition to a PDF parser. It is a separate product with a different regulatory driver, and it is the reason OpenDataLoader has built the infrastructure to write structure tags back into PDF files rather than just output text or Markdown. PDF/UA compliance is becoming a legal requirement in the EU and US, accessibility lawsuits over inaccessible PDFs are increasing, and the market for PDF remediation services ($50-200 per document) is large. OpenDataLoader's Apache 2.0 Tagged PDF output is the first open-source alternative to these paid remediation services for the auto-tagging use case. This is a separate business logic driver from the RAG use case, and it explains the long-term architectural investment in the PDF writer layer that most other open-source PDF tools have not made.

Surprising Takeaway

OpenDataLoader PDF contains a CLAUDE.md file in its repository root. This is the standard convention established by Anthropic for AI coding assistants: a markdown file at the repository root that tells Claude (or other AI coding tools) about the project's structure, conventions, and how to work with the codebase. The presence of this file indicates that the project was actively developed using AI coding assistance and has been structured to be navigable by AI tools, which is increasingly a signal of development process maturity in 2026 open-source projects. The recursion is notable: a tool for processing documents for AI was itself developed with AI coding assistance and uses AI-readable conventions in its own repository.

TL;DR For Engineers

  • OpenDataLoader PDF (opendataloader-project/opendataloader-pdf, Apache 2.0, 8.6k stars) converts PDFs to JSON (with per-element bounding boxes), Markdown (heading-preserving), HTML, Tagged PDF, and plain text. Core is Java 11+, with Python (pip install opendataloader-pdf) and Node.js (npm install @opendataloader/pdf) thin wrappers. Each Python/Node call spawns a JVM process, ~1-2s overhead.

  • XY-Cut++ reading order reconstruction handles multi-column layouts correctly without any model. Table detection uses border analysis + text clustering. Local deterministic path runs at ~0.015s/page with no GPU, no API calls, no data transmission.

  • Hybrid Mode routes per-page to configurable AI backends (docling, hancom, azure, google) via triage="auto" or triage="full". Complex or scanned pages go to AI; simple pages stay local. Cost and latency proportional to actual document complexity.

  • For RAG: use format="markdown" output + MarkdownHeaderTextSplitter (LangChain) for heading-contextualized chunks. Use format="json" to get bounding boxes for source citation at the element level, not just page level.

  • Tagged PDF output (Apache 2.0) is the first open-source end-to-end alternative to $50-200/document manual PDF accessibility remediation. PDF/UA compliance is an enterprise add-on.

The Reading Order Problem Was Always the Parser's Job

The RAG community has spent considerable engineering effort on retrieval algorithms, embedding models, and reranking strategies to improve answer quality. A large fraction of that effort is compensating for what happened upstream: a PDF parser that extracted characters in drawing-instruction order and called it text. OpenDataLoader PDF addresses the problem at its source. Correct reading order, structured table extraction, heading hierarchy preservation, and per-element bounding boxes are document parsing properties. Getting them right at parse time means fewer compensating mechanisms needed downstream, and more transparent failure modes when something does go wrong.

The Tagged PDF pathway is the longer-term signal. Building an open-source PDF writer capable of emitting structure-tagged output is a significant engineering investment that no open-source PDF tool has made before. The payoff is not just a feature for AI pipelines. It is the infrastructure for a different market entirely, which is the scale-economics explanation for why the repo has 501 commits and is under active development.

References

Summary

OpenDataLoader PDF (Apache 2.0, github.com/opendataloader-project/opendataloader-pdf, 8.6k stars) converts PDFs to AI-ready structured formats: JSON with per-element bounding boxes, Markdown with preserved heading hierarchy, HTML, Tagged PDF, and plain text. Its Java core uses XY-Cut++ for multi-column reading order reconstruction and border-analysis-plus-text-clustering for table structure, running deterministically at ~0.015s/page without GPU or API calls. A Hybrid Mode routes per-page to configurable AI backends (docling, hancom, azure, google) only for complex or scanned pages (triage="auto"), making AI-backend latency proportional to actual document complexity. The first open-source tool to generate Tagged PDFs end-to-end under Apache 2.0, it targets both RAG pipeline data preparation (via LangChain integration with bounding-box-annotated JSON) and PDF accessibility automation as a replacement for $50-200/document manual remediation workflows.

Sponsored Ad

If you enjoy practical AI insights, check out SnackOnAI and support the newsletter by subscribing, sharing, and exploring our sponsored ad — it helps us keep building and delivering value 🚀

One AI. Every Tool Your Store Actually Needs.

Most e-commerce sellers are paying for 6 to 8 separate tools that don't talk to each other — and spending hundreds of dollars a month just to keep up. StoreClaw replaces your entire stack with one autonomous AI engine that monitors competitors, optimizes listings, automates marketing, and tracks real profit across Shopify, Amazon, and beyond.

It doesn't wait for you to ask. It runs 24/7 in the background, so you wake up to a full dashboard instead of a list of things you forgot to check.

Connect your store, and StoreClaw gets to work — no prompts, no complex setup, no six-app stack.

Free to start. No credit card required.

Recommended for you