opendataloader-project/opendataloader-pdf

PDF Parser for AI-ready data. Automate PDF accessibility. Open-source.

68
/ 100
Established

Extracts text, tables, and images with bounding boxes and semantic typing using XY-Cut++ reading order analysis; hybrid mode routes complex pages to AI for 0.90 overall accuracy and OCR support. Multi-SDK (Python/Node.js/Java) with LangChain integration for RAG pipelines. Also powers accessibility automation—auto-tags untagged PDFs to Tagged PDF format (Q2 2026) via collaboration with PDF Association and veraPDF, with enterprise PDF/UA export available.

1,958 stars. Actively maintained with 136 commits in the last 30 days.

No Package No Dependents
Maintenance 25 / 25
Adoption 10 / 25
Maturity 15 / 25
Community 18 / 25

How are scores calculated?

Stars

1,958

Forks

135

Language

Java

License

Apache-2.0

Last pushed

Mar 13, 2026

Commits (30d)

136

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/rag/opendataloader-project/opendataloader-pdf"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.