NanoNets/docext
An on-premises, OCR-free unstructured data extraction, markdown conversion and benchmarking toolkit. (https://idp-leaderboard.org/)
Powered by vision-language models, it converts PDFs/images to markdown with semantic understanding of LaTeX equations, signatures, watermarks, and tables, while supporting structured field extraction with confidence scoring via REST API. The toolkit includes a comprehensive benchmarking leaderboard evaluating VLM performance across OCR, key information extraction, document classification, table extraction, and other document intelligence tasks.
1,871 stars and 268 monthly downloads. No commits in the last 6 months. Available on PyPI.
Stars
1,871
Forks
135
Language
Python
License
Apache-2.0
Category
Last pushed
Aug 25, 2025
Monthly downloads
268
Commits (30d)
0
Dependencies
20
Get this data via API
curl "https://pt-edge.onrender.com/api/v1/quality/rag/NanoNets/docext"
Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.
Related tools
PaddlePaddle/PaddleOCR
Turn any PDF or image document into structured data for your AI. A powerful, lightweight OCR...
kreuzberg-dev/kreuzberg
A polyglot document intelligence framework with a Rust core. Extract text, metadata, and...
yfedoseev/pdf_oxide
The fastest PDF library for Python and Rust. Text extraction, image extraction, markdown...
opendataloader-project/opendataloader-pdf
PDF Parser for AI-ready data. Automate PDF accessibility. Open-source.
AKSarav/pdfstract
PDFStract - The Extraction and Chunking Layer in Your RAG Pipeline - Available as CLI - WEBUI - API