opendataloader-project/opendataloader-pdf

PDF Parser for AI-ready data. Automate PDF accessibility. Open-source.

/ 100

Established

Extracts text, tables, and images with bounding boxes and semantic typing using XY-Cut++ reading order analysis; hybrid mode routes complex pages to AI for 0.90 overall accuracy and OCR support. Multi-SDK (Python/Node.js/Java) with LangChain integration for RAG pipelines. Also powers accessibility automation—auto-tags untagged PDFs to Tagged PDF format (Q2 2026) via collaboration with PDF Association and veraPDF, with enterprise PDF/UA export available.

1,958 stars. Actively maintained with 136 commits in the last 30 days.

No Package No Dependents

Maintenance 25 / 25

Adoption 10 / 25

Maturity 15 / 25

Community 18 / 25

How are scores calculated?

Stars

1,958

Forks

135

Language

Java

License

Apache-2.0

Compare

opendataloader-pdf and PaddleOCR

Related tools

PaddlePaddle/PaddleOCR

Turn any PDF or image document into structured data for your AI. A powerful, lightweight OCR...

kreuzberg-dev/kreuzberg

A polyglot document intelligence framework with a Rust core. Extract text, metadata, and...

yfedoseev/pdf_oxide

The fastest PDF library for Python and Rust. Text extraction, image extraction, markdown...

NanoNets/docext

An on-premises, OCR-free unstructured data extraction, markdown conversion and benchmarking...

AKSarav/pdfstract

PDFStract - The Extraction and Chunking Layer in Your RAG Pipeline - Available as CLI - WEBUI - API

Explore RAG Tools

All categories Trending RAG directory Insights