Dicklesworthstone/llm_aided_ocr

Enhances Tesseract OCR output using LLMs (local or API) for error correction, smart chunking, and markdown formatting of scanned PDFs

/ 100

Established

Processes PDFs through a multi-stage pipeline: PDF-to-image conversion, Tesseract OCR, intelligent sentence-boundary chunking with context overlap, then parallel LLM correction via OpenAI/Anthropic APIs or local GGUF models with `llama_cpp`. Includes token management with dynamic adjustment, duplicate paragraph removal, optional header/footer suppression, and built-in quality assessment comparing original OCR against final output.

2,885 stars. Actively maintained with 2 commits in the last 30 days.

No Package No Dependents

Maintenance 13 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 19 / 25

How are scores calculated?

Stars

2,885

Forks

206

Language

Python

License

—

Related tools

NanoNets/docstrange

Extract and convert data from any document, images, pdfs, word doc, ppt or URL into multiple...

hashangit/Extract2MD

Extract2MD is a powerful and versatile AI-enabled client-side JavaScript library for extracting...

th1nhhdk/local_ai_ocr

An local, offline (after initial setup), portable OCR software that can process images and PDF...

emcf/thepipe

Get clean data from tricky documents, powered by vision-language models ⚡

langstruct-ai/langstruct

Extract structured data from any content using LLMs.

Explore LLM Tools

All categories Trending LLM Tool directory Insights