CatchTheTornado/text-extract-api

Document (PDF, Word, PPTX ...) extraction and parse API using state of the art modern OCRs + Ollama supported models. Anonymize documents. Remove PII. Convert any document or picture to structured JSON or Markdown

/ 100

Established

Built on FastAPI with Celery task queues and Redis caching, it supports pluggable OCR strategies (EasyOCR, MiniCPM-V, Llama 3.2-Vision, and remote services like Marker) that can be swapped based on language/accuracy needs. The system runs entirely self-hosted via Docker—no cloud dependencies—with optional remote Ollama integration for scaling LLM-based post-processing, document structure parsing, and PII redaction across diverse document types and table extraction.

2,989 stars.

No Package No Dependents

Maintenance 6 / 25

Adoption 10 / 25

Maturity 16 / 25

Community 19 / 25

How are scores calculated?

Stars

2,989

Forks

252

Language

Python

License

MIT

Related tools

NanoNets/docstrange

Extract and convert data from any document, images, pdfs, word doc, ppt or URL into multiple...

hashangit/Extract2MD

Extract2MD is a powerful and versatile AI-enabled client-side JavaScript library for extracting...

Dicklesworthstone/llm_aided_ocr

Enhances Tesseract OCR output using LLMs (local or API) for error correction, smart chunking,...

th1nhhdk/local_ai_ocr

An local, offline (after initial setup), portable OCR software that can process images and PDF...

emcf/thepipe

Get clean data from tricky documents, powered by vision-language models ⚡

Explore LLM Tools

All categories Trending LLM Tool directory Insights