NanoNets/docext

An on-premises, OCR-free unstructured data extraction, markdown conversion and benchmarking toolkit. (https://idp-leaderboard.org/)

55
/ 100
Established

Powered by vision-language models, it converts PDFs/images to markdown with semantic understanding of LaTeX equations, signatures, watermarks, and tables, while supporting structured field extraction with confidence scoring via REST API. The toolkit includes a comprehensive benchmarking leaderboard evaluating VLM performance across OCR, key information extraction, document classification, table extraction, and other document intelligence tasks.

1,871 stars and 268 monthly downloads. No commits in the last 6 months. Available on PyPI.

Stale 6m
Maintenance 2 / 25
Adoption 16 / 25
Maturity 18 / 25
Community 19 / 25

How are scores calculated?

Stars

1,871

Forks

135

Language

Python

License

Apache-2.0

Last pushed

Aug 25, 2025

Monthly downloads

268

Commits (30d)

0

Dependencies

20

Get this data via API

curl "https://pt-edge.onrender.com/api/v1/quality/rag/NanoNets/docext"

Open to everyone — 100 requests/day, no key needed. Get a free key for 1,000/day.