NanoNets/docext

An on-premises, OCR-free unstructured data extraction, markdown conversion and benchmarking toolkit. (https://idp-leaderboard.org/)

/ 100

Established

Powered by vision-language models, it converts PDFs/images to markdown with semantic understanding of LaTeX equations, signatures, watermarks, and tables, while supporting structured field extraction with confidence scoring via REST API. The toolkit includes a comprehensive benchmarking leaderboard evaluating VLM performance across OCR, key information extraction, document classification, table extraction, and other document intelligence tasks.

1,871 stars and 268 monthly downloads. No commits in the last 6 months. Available on PyPI.

Stale 6m

Maintenance 2 / 25

Adoption 16 / 25

Maturity 18 / 25

Community 19 / 25

How are scores calculated?

Stars

1,871

Forks

135

Language

Python

License

Apache-2.0

Related tools

PaddlePaddle/PaddleOCR

Turn any PDF or image document into structured data for your AI. A powerful, lightweight OCR...

kreuzberg-dev/kreuzberg

A polyglot document intelligence framework with a Rust core. Extract text, metadata, and...

yfedoseev/pdf_oxide

The fastest PDF library for Python and Rust. Text extraction, image extraction, markdown...

opendataloader-project/opendataloader-pdf

PDF Parser for AI-ready data. Automate PDF accessibility. Open-source.

AKSarav/pdfstract

PDFStract - The Extraction and Chunking Layer in Your RAG Pipeline - Available as CLI - WEBUI - API

Explore RAG Tools

All categories Trending RAG directory Insights