opendataloader-project/opendataloader-pdf
PDF Parser for AI-ready data. Automate PDF accessibility. Open-source.
Extracts text, tables, and images with bounding boxes and semantic typing using XY-Cut++ reading order analysis; hybrid mode routes complex pages to AI for 0.90 overall accuracy and OCR support. Multi-SDK (Python/Node.js/Java) with LangChain integration for RAG pipelines. Also powers accessibility automation—auto-tags untagged PDFs to Tagged PDF format (Q2 2026) via collaboration with PDF Association and veraPDF, with enterprise PDF/UA export available.
1,958 stars. Actively maintained with 136 commits in the last 30 days.
Stars
1,958
Forks
135
Language
Java
License
Apache-2.0
Category
Last pushed
Mar 13, 2026
Commits (30d)
136
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/rag/opendataloader-project/opendataloader-pdf"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Compare
Related tools
PaddlePaddle/PaddleOCR
Turn any PDF or image document into structured data for your AI. A powerful, lightweight OCR...
kreuzberg-dev/kreuzberg
A polyglot document intelligence framework with a Rust core. Extract text, metadata, and...
yfedoseev/pdf_oxide
The fastest PDF library for Python and Rust. Text extraction, image extraction, markdown...
NanoNets/docext
An on-premises, OCR-free unstructured data extraction, markdown conversion and benchmarking...
AKSarav/pdfstract
PDFStract - The Extraction and Chunking Layer in Your RAG Pipeline - Available as CLI - WEBUI - API